The way from data novice to professional.

There exists the idea that practicing something for over 10000 h (ten-thousand-hours) lets you acquire enough proficiency with the subject. The concept is based on the book Outliers by M. Gladwell. The mentioned 10k hours are how much time you spend practicing or studying a subject until you have a firm grasp and can be called proficient. Though this amount of hours is somewhat arbitrary, we will take a look on how those many hours can be spent to gain proficiency in the field of Data Science.
Imagine this as a learning budget in your Data-apprenticeship journey. If I were to start from scratch, this is how I would spend those 10 thousand hours to become a proficient Data Scientist.
The Core modules
Mathematics and Statistics: Basic (frequentist) statistics, Bayesian statistics, Intro to Data Analysis, some Probability Theory, some basic Analysis and Calculus.
A Data Scientists main job is to provide insights into data from problem domains. To have a firm grasp of the underlying mathematics is essential. You can find a lot of excellent content from available university courses and Coursera courses. This is a good way to get started on your journey and get a rudimentary understanding of how statistics work. Though, fundamental courses are essential, please challenge yourself with the real deal.
Lets give that 2500h (312 days of 8h work).
Analysis/Stats Languages and modules: R, Python (pandas), Julia, SAS.
This is the bread-and-butter for you and can be considered a part of your stats-education. Here it is about learning by doing. Reading a O’Reilly book on R will only get you so far. Load up a curated data-set, some kaggle challenge and work through some problems.
There is a spectrum on which statistical languages lie. So if you’re close to scientific computing you might consider Julia. The other end of the spectrum with just statistical analysis is SAS. Some people argue that R can do both.
1000h (125 days of 8h work)
Multi-purpose programming languages: Python, Go, Java, Bash, Perl, C++,… .
This very much depends on the systems that you are facing on a daily basis. If you just start out, pick a language and stick to it. Once you learn the concept you will pick up different languages easier.
I for myself rely heavily on a combination of Python and Bash for my daily work. Other tasks require a thorough understanding of the good old Java or even Perl to get started.
2000h (250 days of 8h work).
Database Technologies: {T-, My-, U-}SQL, PostgreSQL, MongoDB, … .
Relational or non-relational databases are some of the systems that you will have to work with in a production environment. It is useful to know, how your data is stored, how queries run under the hood, how to reverse a transaction. For later work your data-sources might vary a lot and it is good to have a basic understanding.
750h (94 days of 8h work)
Operating Systems: Windows, Linux, MacOS.
Whatever your work-environment is: master it! Know the ins-and outs. You might need to install some really weird library this one time to solve a specific problem (true story). Knowing where things are and why they are there goes a long way. Know what an SSH connection is and how to run analysis on a different system. Running analysis not on your local machine is going to happen at some point in the future.
500h (62 days of 8h work) .
ETL/ELT Systems: This is a mixture between programming languages, database technologies and operating systems. Frameworks like Spark, Hadoop, Hive offer a advanced means for data-storage and also analysis on cluster computing platforms. Some companies may rely on a different tech-stacks like Microsoft Azure Systems, Google Cloud or AWS solutions.
This goes hand in hand with Database Technologies and might already require a firm understanding of higher level programming languages (like Java or Python). There are also beginner systems, like KNIME to get your feet wet .
400 (50 days of 8h)
Your Problem Cases: This may be Fin-Tech, Bio-Tech, Business Analytics
You should be familiar with the problem domain in which you are working. Is it research and development that you are doing? Are you visualizing business processes and customer behavior? Different field require different insights. Your data has to make sense to you and you should be able to see if a model-output is in a valid range or totally off. Spend a very absolute minimum of 350h in your problem domain. We suggest more. A lot more.
You can decide if you want to be the method jack-of-all-trades or the expert in your field.
> 350h (44 days of 8h)
The attentive reader sees that this only adds up to 7500 hours so far. We have a basis now and you might want to go into a certain direction from here.
Different Data-trades – Your personal direction
Back in the olden days, the dark-unenlightened ages, we had guilds. The field of working with data also has different guilds. Types of tradesmen that solve different problems and there have been different articles on the subject.
Here is how those trades differ in their requirements:
- The Data Scientist: You do stats and analysis, you roll-out solutions and deploy platforms that answer questions and provide insights. Neat. Up your statistics game: + 200h. Maybe a bit of ML which we call statistical learning +100h and some type of multi-purpose-language like Python, Perl or Bash for your dayjob.
- The Data Engineer: You care about the data. You build the systems that enable the Data Scientist to work effectively and give the analysts their daily bread. Your focus lies on the underlying systems. +500h on Systems, +200h on ETL/ELT systems, +200h on DBs and programming languages +200h. You can even afford to not be so heavily involved in the statistics part (-500h on pure stats and theory).
- The Machine Learning Engineer: You implement the models that make the magic happen.
You need all the statistical insights and also proficiency with all the ML on top of things –
+300h on ML, + 200h on Stats. You should also be well versed in a high level programming language which makes implementing ML-models possible in the first place. - The Analyst: You take data from the systems, make it pretty and reportable for management. You need to know what matters in your problem domain +200h in problem cases, +100h in DB systems. You need your SQL daily. Though there is a big variety of analyst jobs out there. So inform yourself and continue learning what you see is necessary.


Math/Stats = Mathematics and Statistics, includes your work with stat-languages like R.
Systems = Setting up and maintaining systems (e.g. ETL/ELT)
DBs = Database Systems (relational, non-relational)
SWE = Software Engineering (Python, Perl, Java, Scala, Go, JavaScript, …)
Exp. Field = Experience in the problem domain. Hours spent on the topic.
The Figures above illustrate exemplary careers and how much time and effort can be placed in each domain. The figures make suggestions and vary from individual to individual. For example the Data-Engineer in our example could also be called a Software Engineer that handles a lot of data and another engineer would need more experience working with databases compared to programming languages.
As with everything in life the lines are blurred. Sometimes a Data Scientist is also does the work of a ML Engineer and vice versa. Sometimes a Data Scientist also has to do Analytics Dashboards. Stranger things have happened in the past. Keep in mind that the 10k budget is your entry and gives you a direction into the field and not your specialization. So you have all the freedom in the world to eventually do that PhD in Parapsychology that you’ve always wanted.
How much work is ten thousand hours
10k hours is a long time. if you were to work for one year that would be 27h per day and not so feasible for the average mortal.
So better 3 years of work for around 9 h might be better suited.
If you’re studying on the side and look for a change in your job at some point in the future, then you might be set after a couple more years. The important part is to be consistent with it. A couple of quality hours a day over a few years will get you a long way.
Personal Note
My journey started in October 2016 coming from systems administration and starting a new undergraduate program. Now that it is July 2020 it has been around 1360 days of gaining experience in the field. Therefore I am in the field of >7500 h (with over 5.5h every day) and still continue to study and learn a lot every day. My undergraduate program in Cognitive Science helped me get a head-start in the fields of computer science, applied mathematics, Bayesian statistics, DB systems, formal logic, Machine Learning courses and much more. This together with working as a Data Engineer part-time helped me gain a lot of insights really quickly.
The Key element
I don’t care if your do your 10 thousand hours at Uni, in school, at seminars or at home.
Nobody cares. You have to show your work. A degree, if done right, can show this – but it doesn’t have to. It is about gaining experience in the field by solving problems and gaining proficiency with concepts.
Solve a problem and put it in a portfolio, your resume or simply your github-repository.
Whatever works best for you. People hire for your problem solving skills. Not for the A that you got on that one project 2 years ago. You might say that it was a really cool project altogether.
That’s great! if you made a dent in the universe you should tell people about it anyways.
The journey is a long one and if you enjoy what you do it is worthwhile. You will learn a lot along the way.
A short Disclaimer: Always keep in mind that there are different fields of Data Science related work out in the wild.
Every job has its specific purpose and requires a different tool-sets.
Every employer might look for different sets of skills.