The goal of this analisys is to explore skills mostly seeked by ukrainian companies, that are looking for Data Scientists
Results of this exploration will be used to help to restructure curriculum of masters's program to help students to gain skills required by market.
Scraping data from djinni.co
I choosed web site djinni.co for a source of data as it provides most-relevant urkainian job postings as can be scraped through simple to use json REST-api.
Pulling preloaded data from file
These models are required for corpus analysis
Let's take a look into a single job description to determine futher steps, rerquired to retrive useful insights.
As we can see, each posting is full of "dirty" tokens: symbols, stop-words like "in", "of", etc., and ukrainian descriptions, which can not be used to determine key skills, so we have to remove them from text.
First step: remove comoon english "stop-words"
Greate, now we can see more useful tokens, next step is to clean punctuation symbols:
Now we can see some useful tokens that represent skills, like "swift" of "objective-c".
But some words are still useless, let's try to calssify them by the part of speech, maybe that could filter out some of them.
Even though we would like to filter out adjectives, some of them turned out to be useful and describe required skills, so we can not use POS classification to shrink our dataset.
Nevertheless, it looks that some words like "developmet", "learning", "design" would make more sence, if those were used in pair with other words from job description. Let's try to build bi-grams: word tuples, that consists of pairs of consuent words.
This pairwise analysis produces a big list of word pairs,which would be nice to have as a single skill/token.
Also, turns out that lots of soft-skills are well-represented as a word pair, like "'communication', 'skills'", "'hands-on', 'experience'", "'spoken', 'english'".
Unfortulately, there are also lot's of job-perks, that are not relevant for our analysis, so it would be nice to filter them out.
Final data processing
Below we apply all cleaning and tokinezation process to entire dataset, so we could get insigts on most-frequantly mentioned skills.
To reduce impact of the same words mentioned in single job-posting, final bag of tokens is extended only by set of tokens from single entry.
When we have cleaned-up set of tokens, we can group them to cound frequencies of each of them
One of the first questions asked by student, how seeks to persue Data Scientist's carrer is which programming language to learn.
Distribution below shows that python is a huge winner and tripples result of any other languages. Qute interesting that Java and C are also in some demand.
Despite common DS community opinion, PyTorch (mostly backed by acadimia) get's ahead of TensorFlow (baked by Enterprise)
Also, even though pandas is rarely used in prodaction, it looks like it's also required probably for some exploratory tasks.
As for different direction of ML, NLP and CV gets far anead of fraud detection and should get it's own place as a separate cource in Master's degree
More then 35% job postings expect from applicant not only to process data, but also to be able to deploy their models by themselfs. Such kind of topics should be touched in introduction to data sciense and coveder in depth in separate course
Big chunk of job posting mantion data warehousing and data pipelinig as a skill expected from their applicants. Such technologies are widely used but really hard to get hands-on experience with. Solving this problem can be a good idea for an EdTech Startup
There are also lot's of technologies and other technical skills, expected by companies
It's a bit harder to detect soft-skills required from data scientist, but 30% of companies emphaiszes communication in their postings, which supports idea to include oratory class into couse curriculum