Your 10.000 hours in Data Science | A Walk down Data-Lane

The way from data novice to professional.

Clock with reverse numeral, Jewish Townhall Clock, Prague- by Richard Michael (all rights reserved).

There exists the idea that practicing something for over 10000 h (ten-thousand-hours) lets you acquire enough proficiency with the subject. The concept is based on the book Outliers by M. Gladwell. The mentioned 10k hours are how much time you spend practicing or studying a subject until you have a firm grasp and can be called proficient. Though this amount of hours is somewhat arbitrary, we will take a look on how those many hours can be spent to gain proficiency in the field of Data Science.
Imagine this as a learning budget in your Data-apprenticeship journey. If I were to start from scratch, this is how I would spend those 10 thousand hours to become a proficient Data Scientist.

The Core modules

Mathematics and Statistics: Basic (frequentist) statistics, Bayesian statistics, Intro to Data Analysis, some Probability Theory, some basic Analysis and Calculus.
A Data Scientists main job is to provide insights into data from problem domains. To have a firm grasp of the underlying mathematics is essential. You can find a lot of excellent content from available university courses and Coursera courses. This is a good way to get started on your journey and get a rudimentary understanding of how statistics work. Though, fundamental courses are essential, please challenge yourself with the real deal.
Lets give that 2500h (312 days of 8h work).
Analysis/Stats Languages and modules: R, Python (pandas), Julia, SAS.
This is the bread-and-butter for you and can be considered a part of your stats-education. Here it is about learning by doing. Reading a O’Reilly book on R will only get you so far. Load up a curated data-set, some kaggle challenge and work through some problems.
There is a spectrum on which statistical languages lie. So if you’re close to scientific computing you might consider Julia. The other end of the spectrum with just statistical analysis is SAS. Some people argue that R can do both.
1000h (125 days of 8h work)

Multi-purpose programming languages: Python, Go, Java, Bash, Perl, C++,… .
This very much depends on the systems that you are facing on a daily basis. If you just start out, pick a language and stick to it. Once you learn the concept you will pick up different languages easier.
I for myself rely heavily on a combination of Python and Bash for my daily work. Other tasks require a thorough understanding of the good old Java or even Perl to get started.
2000h (250 days of 8h work).

Database Technologies: {T-, My-, U-}SQL, PostgreSQL, MongoDB, … .
Relational or non-relational databases are some of the systems that you will have to work with in a production environment. It is useful to know, how your data is stored, how queries run under the hood, how to reverse a transaction. For later work your data-sources might vary a lot and it is good to have a basic understanding.
750h (94 days of 8h work)

Operating Systems: Windows, Linux, MacOS.
Whatever your work-environment is: master it! Know the ins-and outs. You might need to install some really weird library this one time to solve a specific problem (true story). Knowing where things are and why they are there goes a long way. Know what an SSH connection is and how to run analysis on a different system. Running analysis not on your local machine is going to happen at some point in the future.
500h (62 days of 8h work) .

ETL/ELT Systems: This is a mixture between programming languages, database technologies and operating systems. Frameworks like Spark, Hadoop, Hive offer a advanced means for data-storage and also analysis on cluster computing platforms. Some companies may rely on a different tech-stacks like Microsoft Azure Systems, Google Cloud or AWS solutions.
This goes hand in hand with Database Technologies and might already require a firm understanding of higher level programming languages (like Java or Python). There are also beginner systems, like KNIME to get your feet wet .
400 (50 days of 8h)

Your Problem Cases: This may be Fin-Tech, Bio-Tech, Business Analytics
You should be familiar with the problem domain in which you are working. Is it research and development that you are doing? Are you visualizing business processes and customer behavior? Different field require different insights. Your data has to make sense to you and you should be able to see if a model-output is in a valid range or totally off. Spend a very absolute minimum of 350h in your problem domain. We suggest more. A lot more.
You can decide if you want to be the method jack-of-all-trades or the expert in your field.
> 350h (44 days of 8h)

The attentive reader sees that this only adds up to 7500 hours so far. We have a basis now and you might want to go into a certain direction from here.

Different Data-trades – Your personal direction

Back in the olden days, the dark-unenlightened ages, we had guilds. The field of working with data also has different guilds. Types of tradesmen that solve different problems and there have been different articles on the subject.
Here is how those trades differ in their requirements:

  1. The Data Scientist: You do stats and analysis, you roll-out solutions and deploy platforms that answer questions and provide insights. Neat. Up your statistics game: + 200h. Maybe a bit of ML which we call statistical learning +100h and some type of multi-purpose-language like Python, Perl or Bash for your dayjob.
  2. The Data Engineer: You care about the data. You build the systems that enable the Data Scientist to work effectively and give the analysts their daily bread. Your focus lies on the underlying systems. +500h on Systems, +200h on ETL/ELT systems, +200h on DBs and programming languages +200h. You can even afford to not be so heavily involved in the statistics part (-500h on pure stats and theory).
  3. The Machine Learning Engineer: You implement the models that make the magic happen.
    You need all the statistical insights and also proficiency with all the ML on top of things –
    +300h on ML, + 200h on Stats. You should also be well versed in a high level programming language which makes implementing ML-models possible in the first place.
  4. The Analyst: You take data from the systems, make it pretty and reportable for management. You need to know what matters in your problem domain +200h in problem cases, +100h in DB systems. You need your SQL daily. Though there is a big variety of analyst jobs out there. So inform yourself and continue learning what you see is necessary.
Fig. 1 – Time and experience from a Data Scientist compared to an analyst’s job..
Fig. 2 – Time and experience of an ML-Engineer compared to a Data-Engineer.
Math/Stats = Mathematics and Statistics, includes your work with stat-languages like R.
Systems = Setting up and maintaining systems (e.g. ETL/ELT)
DBs = Database Systems (relational, non-relational)
SWE = Software Engineering (Python, Perl, Java, Scala, Go, JavaScript, …)
Exp. Field = Experience in the problem domain. Hours spent on the topic.

The Figures above illustrate exemplary careers and how much time and effort can be placed in each domain. The figures make suggestions and vary from individual to individual. For example the Data-Engineer in our example could also be called a Software Engineer that handles a lot of data and another engineer would need more experience working with databases compared to programming languages.

As with everything in life the lines are blurred. Sometimes a Data Scientist is also does the work of a ML Engineer and vice versa. Sometimes a Data Scientist also has to do Analytics Dashboards. Stranger things have happened in the past. Keep in mind that the 10k budget is your entry and gives you a direction into the field and not your specialization. So you have all the freedom in the world to eventually do that PhD in Parapsychology that you’ve always wanted.

How much work is ten thousand hours

10k hours is a long time. if you were to work for one year that would be 27h per day and not so feasible for the average mortal.
So better 3 years of work for around 9 h might be better suited.
If you’re studying on the side and look for a change in your job at some point in the future, then you might be set after a couple more years. The important part is to be consistent with it. A couple of quality hours a day over a few years will get you a long way.

Personal Note

My journey started in October 2016 coming from systems administration and starting a new undergraduate program. Now that it is July 2020 it has been around 1360 days of gaining experience in the field. Therefore I am in the field of >7500 h (with over 5.5h every day) and still continue to study and learn a lot every day. My undergraduate program in Cognitive Science helped me get a head-start in the fields of computer science, applied mathematics, Bayesian statistics, DB systems, formal logic, Machine Learning courses and much more. This together with working as a Data Engineer part-time helped me gain a lot of insights really quickly.

The Key element

I don’t care if your do your 10 thousand hours at Uni, in school, at seminars or at home.
Nobody cares. You have to show your work. A degree, if done right, can show this – but it doesn’t have to. It is about gaining experience in the field by solving problems and gaining proficiency with concepts.
Solve a problem and put it in a portfolio, your resume or simply your github-repository.
Whatever works best for you. People hire for your problem solving skills. Not for the A that you got on that one project 2 years ago. You might say that it was a really cool project altogether.
That’s great! if you made a dent in the universe you should tell people about it anyways.

The journey is a long one and if you enjoy what you do it is worthwhile. You will learn a lot along the way.

A short Disclaimer: Always keep in mind that there are different fields of Data Science related work out in the wild.
Every job has its specific purpose and requires a different tool-sets.
Every employer might look for different sets of skills.

Belief Networks

or Bayesian Belief Networks combine probability and graph theory to represent probabilistic models in a structured way.

Read: 12 min
Goals:

  • translate graphical models to do inference
  • read and interpret graphical models
  • identify dependencies and causal links
  • work through an example

Intro

Belief Networks (BN) combine probabilities and their graphical representation to show causal relationships and assumptions about the parameters – e.g. independence between parameters. This is done through nodes and directed links (vertices) between the nodes to form a Directed Acyclic Graph (DAG). This also allows to display operations like conditioning on parameters or marginalizing over parameters.

A DAG is a graph with directed vertices
that does not contain any cycles. 
Fig 1. A basic representation of a Belief Network as a graph.

Use and Properties

In Fig.1 we can see a model with four parameters lets say rain (R), Jake (J), Thomas (T) and sprinkler (S). We observe that R has an effect on our parameters T and J. We can make a distinction between cause and effect – e.g. the rain is causal to Jake being wet.
Further we can model or observe constraints that our model holds, such that we do not have to account for all combination of possible parameters (2^N) and can reduce the space and computational costs.
This graph can represent different independence assumptions through the directions of the vertices (arrows).

Contitional Independence

We know that when conditional independence holds we can express a joint probability as p(x_1,x_2,x_3)=(x_i_1 | x_i_2,x_i_3,)p(x_i_2|x_i_3)p(x_i_3). This gives six possible permutations, so that simply drawing a graph does not work.

Remember there should not be any cycles in the graph. Therefore we have to drop at least two vertices such that we get 4 possible DAGs from the 6 permutations.
Now, when two vertices point to one and the same node we get a collision:

Fig. 2 a) Collisions and induced dependence
Fig. 2b) Model conditioned on T

In case a collision occurs we get either d-separation or a d-connection. For example (Fig. 2) we see that X and Y are independent of each other and are ’cause’ to the effect Z. We can also write p(X,Y,Z)=p(Z|X, Y)p(X)p(Y).
However if we would condition a model on the collision node (see Fig. 2b), we get p(X,Y|T)\neq p(X|T)p(Y|T) and X and Y are not independent anymore.
A connection is introduced on the right model in Fig. 2a where X and Y are now conditionally dependent given Z.

Belief Networks encode conditional independence well, they do not encode dependence well.

Definition: Two nodes (\mathcal{X} and \mathcal{Y}) are d-separated by \mathcal{Z} in a graph G are if and only if they are not d-connected.
Pretty straight forward, right?

A more applicable explanation is: for every variable x in \mathcal{X} and y in \mathcal{Y}, trace every path U between x and y, if all paths are blocked then two nodes are d-separated. For the definition of blocking and descendants of nodes you can find more in this paper.

Remember: X is conditionally independent of Y if p(X,Y)=p(X)p(Y).

Now away from the theory to something more practical:

How to interact with a model

  1. Analyze the problem and/or set of parameters
    1. set a graphical structure for the parameters in a problem with a DAG
    2. OR reconstruct the problem setting from the parameters
  2. Compile a table of all required conditional probabilities – p(x_i|p_a(x_i))
  3. Set and specify parental relations

Example

Recap Fig. 1 – our short example case

Thomas and Jake are neighbors. When Thomas gets out of the house in the morning he observes that the grass is wet (effect). We want to know how likely it is that Thomas forgot to turn of the sprinkler (S). Knowing that it had rained the day before explains away that the grass is wet in the morning. So let us look at the probabilities:

We have 4 boolean (yes/no) variables (see Fig. 1):
R (rain [0,1]), S (sprinkler [0,1]), T (thomas’s-grass [0,1]), J (jake’s-grass [0,1]). So that we can say that Thomas’s grass is wet, given that Jake’s grass is wet, the sprinkler was off and it rained as p(T=1 | J=1, S=0, R=1).
Already in such small example we have a lot of possible states, namely 2^N=2^4=16 values. We already know that our model has constraints, e.g. the sprinkler is independent of the rain and when it rains both Thomas’s and Jake’s grass is wet.

After taking into account our constraints we can factorize our joint probability into
p(T, J, R, S) = p(T|R,S)p(J|R)p(R)p(S) – we say: “The joint probability is computed by the probability that Thomas’s grass is wet, given rain and sprinkler multiplied with the probability that Jake’s grass is wet, given that it rained and the probability that it rained and the probability that the sprinkler was on.” We have now reduced our problem to 4+2+1+1=8 values. Neat!

We use factorization of joint probabilities to reduce the total number of required states and their (conditional) probabilities.

p(X)value
p(R=1)0.2
p(S=1)0.1
p(J=1 | R=1)1.0
p(J=1 | R=0)*0.2
p(T=1 | R=1, S=0)1.0
p(T=1 | R=1, S=1)1.0
p(T=1 | R=0, S=1)0.9
p(T=1 | R=0, S=0)0.0
Our conditional probability table (CPT)
* sometimes Jake’s grass is simply wet – we don’t know why…

We are still interested if Thomas left the sprinkler on. Therefore compute the posterior probability p(S=1|T=1) = 0.3382 (see below).

    \begin{align*} p(S=1|T=1) &= \frac{p(S=1,T=1)}{p(T=1)}\\ &=\frac{\sum_{J,R}p(T=1, J, R, S=1)}{\sum_{J,R,S}p(T=1, J, R, S)}\\ &=\frac{\sum_R p(T=1|R,S=1)p(R)p(S=1)}{\sum_{R,S}p(T=1|R,S)p(R)p(S)}\\ &=\frac{0.9*0.8*0.1+1*0.2*0.1}{0.9*0.8*0.1+1*0.2*0.1+0+1*0.2*0.9}\\ &= 0.3382 \end{align*}

When we compute the posterior given that Jake’s grass is also wet we get
p(S=1|T=1, J=1) = 0.1604

    \begin{align*} p(S=1|T=1, J=1) &= \frac{p(S=1,T=1, J=1)}{p(T=1, J=1)}\\ &=\frac{\sum_R p(J=1|R)p(T=1|R,S=1)p(R)p(S=1)}{\sum_{R,S}p(J=1|R)p(T=1|R,S)p(R)p(S)}\\ &=\frac{0.9*0.8*0.1+1*0.2*0.1}{0.9*0.8*0.1+1*0.2*0.1+0+1*0.2*0.9}\\ &= \frac{0.0344}{0.2144}=0.1604 \end{align*}

We have shown that it is less likely that Thomas sprinkler is on, when Jake’s grass is also wet. Where Jake’s grass is extra evidence that affects the likelihood of our observed effect.


Uncertainties

Fig. 3 – Uncertainties and unreliable evidence

In case an observed outcome holds uncertainty or we do not trust the source of our values we can account for that. Lets assume the sprinkler malfunctions sometimes, so that instead of a hard [0,1] we get a vector S=[0.2, 0.8] instead, also called soft evidence. We denote this with dashed nodes (see Fig. 2).
Lets assume Jake’s house is quite far away and there is an added unreliability in the evidence that his grass is wet — maybe Jake is a notorious liar about the wetness of his soil. We denote this with a dotted vertex (see Fig.2).
To solve and to account for uncertainties and unreliability is a whole different topic, which we will address in another post.

Limitations

Some dependency statements cannot be represented structurally in this way; for example marginalized graphs.
One famous example that BNs hold errors when it comes to causal relations is the Simpson Paradoxon . This can be solved with atomic intervention, which is a topic for yet another post.

Summary

  • (Bayesian) Belief Networks represent probabilistic models as well as the factorization of distributions into conditional probabilities
  • BNs are directed acyclic graphs (DAGs)
  • We can reduce the amount of computation and space required by taking into account constraints from the model which are expressed structurally
  • conditional independence is expressed as absence of a link in a network

References

  1. D. Barber, Bayesian Reasoning and Machine Learning. Cambridge University Press. USA. 2012: pp.29-51
  2. G. Pipa, Neuroinformatics Lecture Script. 2015: pp.19-26