Python Packages Commonly Used Together
I spoke about this with @avesunaden some while ago, and it got me pretty excited. Can we create a more interesting interface to explore Python packages? Similar to bundlephobia.com?
One weekend I decided to have a crack at it.
A short summary
2. Building the recommender
I chose collaborative filtering with implicit feedback, optimized by ALS.
3. Serving the results
To render the results on package.wiki I loaded all of the recommendations into Firebase, where I previously put the rest of the package metadata. A simple Flask app then serves the website.
Details
1. Getting the data
Some time ago I created a BigQuery integration, so I can just re-use it now and Deepnote suggests the code below to easily connect.
This query scans about 2.5TB, so careful, it costs about $12.50. However, the result is pretty small and only takes a few dozens MB.
2. Building the recommender
Transform the data into a matrix form, as accepted by the algorithm, so we pivot.
Train it. I chose the implicit library because I used it in the past, but there's a number of them which do the job.
Initially I tried to make BigQuery do it for me with its native model:
CREATE OR REPLACE MODEL
`deepnote-200602.python_packages.used_together_model`
OPTIONS(MODEL_TYPE = 'MATRIX_FACTORIZATION'
, FEEDBACK_TYPE = 'IMPLICIT'
, NUM_FACTORS = 50
, USER_COL = 'file_id'
, ITEM_COL = 'package'
, RATING_COL = 'rating')
AS SELECT
file_id,
package,
rating
FROM
`deepnote-200602.python_packages.used_together`
But that didn't work, because sadly Google decided to make it pretty hostile to do that.
Define the predict function
Batch predict for all of the packages we have
3. Serving the results
You can explore the results on Package Wiki. I plan to add some additional insights shortly.
It reads the information from Firestore, this is how I loaded the data. I also add summaries of the related packages.*
*Because Firestore isn't relational and has no joins, so it's easier to just store a copy of the summaries. Postgres would also make sense I guess.
Evaluation - explore the results
I didn't split my dataset and didn't check any error metrics because I don't care. I just wanted to explore.
My notes and ideas so far:
- It's interesting that out of over 250,000 packages, less than 20,000 make it into public
requirements.txt
on GitHub. - numpy has some pretty weird packages recommended, I expected to know more of them, and to see things like scipy or pandas.
- Often people create requirements by running
pip freeze > requirements.txt
, so the recommendations are full of dependencies. I might try filtering those out next time.
Did you find anything interesting? Shoot me a message.