What to Learn to Become a Data Scientist in 2021
When I started learning data science a few years ago most job ads requested a PhD, or at the very least a masters, in maths, statistics or a similar subject as an essential requirement.
Over the last couple of years, things have evolved. With the development of machine learning libraries that abstract away much of the complexity behind the algorithms, and a realisation that practically applying machine learning to solve business problems requires a set of skills that are not usually acquired through academic study alone. Companies are now hiring data scientists based on their ability to perform applied data science rather than research.
Applied data science that delivers value to a business in the fastest possible time requires a very practical skillset. Additionally, as more companies migrate their data and machine learning solutions to the cloud, It is becoming paramount for data scientists to have an understanding of the new tools and technology relating to this.
Additionally, I believe that the days of a data scientist working solely on data modelling, using data pulled together by data engineers, and then handing the model over to a team of software engineers to put into production are largely behind us. Particularly outside of the tech giants such as Amazon, Facebook and Google. In most companies, with the exception of some of the very big tech players, there either isn’t the resource available in those teams or the alignment of priorities are not there at the right time.
“There is a saying, ‘A jack of all trades and a master of none.’ When it comes to being a data scientist you need to be a bit like this, but perhaps a better saying would be, ‘A jack of all trades and a master of some.’” Brendan Tierney, Principal Consultant at Oralytics.
In order for a data scientist to deliver maximum value to a business, they need to be able to work across the full model development life cycle. Having at least a working knowledge in developing data pipelines, performing data analysis, machine learning, maths, statistics, data engineering, cloud computing and software engineering. This means that as we move into 2021 the data scientist generalist is the preferred hire for most businesses.
“The bigger the picture, the more unique the potential human contribution. Our greatest strength is the exact opposite of narrow specialization. It is the ability to integrate broadly.”, David Epstein, Why Generalists Triumph in a Specialized World.
This article doesn’t cover absolutely everything you need to be a data scientist in 2021. Instead, it covers the key skills, both new and old, that have become the most essential for every successful data scientist to have in the near future.
1. Python 3
There are still some cases where data scientists may use R but generally speaking if you are doing applied data science these days, then Python is going to be the most valuable programming language to learn.
Python 3 (the latest version) has now firmly become the default version of the language for most applications as support for Python 2 was dropped by the majority of libraries on 1st January 2020. If you are learning Python for data science now it is important to choose a course that works with this version.
You will need a good understanding of the basic syntax of the language and how to write functions, loops and modules. Be familiar with both object-oriented and functional programming in Python, and be able to develop, execute and debug programs.
Pandas is still the number one Python library for data manipulation, processing and analysis. In 2021 this is still one of the most vital skills to have as a data scientist.
Data is at the very heart of any data science project and Pandas is the tool that will enable you to extract, clean, process and derive insights from it. Most machine learning libraries also generally take Pandas DataFrames as a standard input these days.
3. SQL and NoSQL
SQL has been around since the 1970’s but it still remains one of the most vital and saught after skills for data scientists. The vast majority of businesses use relational databases as their analytical data stores and as a data scientist SQL is the tool that will deliver you this data.
NoSQL (“not only SQL”) are databases that don’t store data as relational tables, instead data is stored as key value pairs, wide-columns or graphs. Example NoSQL databases include Google Cloud Bigtable and Amazon DynamoDB.
As the volumes of data collected by companies increases and unstructured data becomes more regularly used in machine learning models organisations are turning to NoSQL databases, either as a complement or as an alternative to, the traditional data warehouse. This trend is likely to continue into 2021 and as a data scientist it is important to gain at least a basic understanding of how to interact with data in this form.
According to a report from O’reilly in January this year, titled ‘Cloud adoption in 2020’, 88% of organisations were at this time using some form of cloud infrastructure. The impact of Covid-19 is likely to have further accelerated this adoption.
“At first glance, cloud usage seems overwhelming. More than 88% percent of respondents use cloud in one form or another. Most respondent organizations also expect to grow their usage over the next 12 months.”, Cloud Adoption 2020, By Roger Magoulas and Steve Swoyer.
The use of cloud in other areas of a business usually goes hand in hand with cloud-based solutions for data storage, analytics and machine learning. The major cloud providers such as Google Cloud Platform, Amazon Web Services and Microsoft Azure are developing out tooling for training, deploying and serving machine learning models at a rapid pace.
As a data scientist working in 2021 and beyond it is very likely that you will be working with data housed in a cloud-based database such as Google BigQuery and developing cloud based machine learning models. Experience and skills in this area are likely to be in high demand as we move into 2021.
Apache Airflow, an open source workflow management tool, is rapidly being adopted by many businesses for the management of ETL processes and machine learning pipelines. Many large tech companies such as Google and Slack are using it and Google even built their cloud composer tool on top of this project.
I am noticing Airflow being mentioned more and more often as a desirable skill for data scientists on job adverts. As mentioned at the beginning of this article I believe it will become more important for data scientists to be able to build and manage their own data pipelines for analytics and machine learning. The growing popularity of Airflow is likely to continue at least in the short term, and as an open source tool, is definitely something that every budding data scientist should at learn.
6. Software engineering
Data science code is traditionally messy, not always well tested and lacking in adherence to styling conventions. This is fine for initial data exploration and quick analysis but when it comes to putting machine learning models into production then a data scientist will need to have a good understanding of software engineering principles.
If you are planning to work as a data scientist it is likely that you will either be putting models into production yourself or at least be involved heavily in the process. It is therefore essential to cover the following skills in any learning that you undertake.
Code conventions such as the PEP 8 Python style guide.
Version control e.g. Github.
Dependancies and virtual environments.
Containers e.g. Docker.
In this article, I wanted to highlight some of the key trends emerging in terms of the skills required for data scientists. These insights have been gleaned from reviewing current data science job adverts, my own experience working as a data scientist and reading articles covering future trends in the field.
This is not meant as an exhaustive list, there are certainly a lot more skills and experience needed to become a successful data scientist. However, in this post, I wanted to cover some of the most important skills that are very likely to be required in the coming year.
For a more comprehensive list of skills that you should learn, if you are studying to be a data scientist, I wrote a series of articles giving a complete roadmap for learning. They are linked below.
Programming Skills, A Complete Roadmap for Learning Data Science — Part 1
A complete guide to programming skills for data science, includes links to free learning resources.
Data Analysis, A Complete Roadmap for Learning Data Science — Part 2
Part two of my complete roadmap for learning data science takes a look at the important skills needed for data…
Maths and Statistics, A Complete Roadmap for Learning Data Science — Part 3
Key concepts in maths and statistics for data science, and where to learn them.
Thanks for reading!
I send out a monthly newsletter if you would like to join please sign up via this link. Looking forward to being part of your learning journey!