Package Wiki  

by Robert LacokOct 26, 2020
2 likes1 duplicates
Share
Twitter iconTwitter
Facebook iconFacebook
Email
Copy link
Save as PDF
  1. Python Packages Commonly Used Together
    1. A short summary
      1. 1. Getting the data
      2. 2. Building the recommender
      3. 3. Serving the results
    2. Details
      1. 1. Getting the data
      2. 2. Building the recommender
      3. 3. Serving the results
    3. Evaluation - explore the results

Python Packages Commonly Used Together

I spoke about this with @avesunaden some while ago, and it got me pretty excited. Can we create a more interesting interface to explore Python packages? Similar to bundlephobia.com?

One weekend I decided to have a crack at it.

A short summary

1. Getting the data

We use two data sources:

  • The public GitHub BigQuery dataset , where we fetch and parse package names from all public requirements.txt files
  • The Pyhon Software Foundation dataset , where we fetch some metadata about each package

2. Building the recommender

I chose collaborative filtering with implicit feedback, optimized by ALS.

3. Serving the results

To render the results on package.wiki I loaded all of the recommendations into Firebase, where I previously put the rest of the package metadata. A simple Flask app then serves the website.

Details

1. Getting the data

Some time ago I created a BigQuery integration, so I can just re-use it now and Deepnote suggests the code below to easily connect.

import json import os from google.oauth2 import service_account from google.cloud import bigquery bq_credentials = service_account.Credentials.from_service_account_info(json.loads(os.environ['BIGQUERY_DEEPNOTE_SERVICE_ACCOUNT'])) bq_client = bigquery.Client(credentials=bq_credentials, project=bq_credentials.project_id)

This query scans about 2.5TB, so careful, it costs about $12.50. However, the result is pretty small and only takes a few dozens MB.

query = """ SELECT package, file_id, ANY_VALUE(rating) AS rating FROM ( SELECT name FROM the-psf.pypi.distribution_metadata GROUP BY name) m LEFT JOIN ( SELECT package, file_id, 1 AS rating FROM ( SELECT f.id AS file_id, ARRAY( SELECT SPLIT(SPLIT(row_, '==')[ OFFSET (0)], '>=')[ OFFSET (0)] FROM UNNEST(SPLIT( c.content, '\n')) AS row_ WHERE row_ NOT LIKE "#%" GROUP BY row_) AS requirements FROM ( SELECT id, ANY_VALUE(path) AS path FROM `bigquery-public-data.github_repos.files` GROUP BY id) f LEFT JOIN `bigquery-public-data.github_repos.contents` c ON f.id = c.id AND f.path LIKE "%requirements.txt" AND c.content IS NOT NULL), UNNEST(requirements) AS package) nested ON m.name = nested.package WHERE package IS NOT NULL GROUP BY package, file_id """ query_job = bq_client.query(query) df = query_job.to_dataframe() df

2. Building the recommender

Transform the data into a matrix form, as accepted by the algorithm, so we pivot.

# This is kind of what we're going for: # >>> df = df.pivot(index='file_id', columns='package', values='rating') # but it creates a huge matrix full of zeroes, # so we opt for a sparse representation from scipy.sparse import csr_matrix from pandas.api.types import CategoricalDtype file_c = CategoricalDtype(sorted(df.file_id.unique()), ordered=False) package_c = CategoricalDtype(sorted(df.package.unique()), ordered=False) col = df.file_id.astype(file_c).cat.codes row = df.package.astype(package_c).cat.codes sparse_matrix = csr_matrix((df["rating"], (row, col)), \ shape=(package_c.categories.size, file_c.categories.size)) sparse_matrix

Train it. I chose the implicit library because I used it in the past, but there's a number of them which do the job.

import implicit model = implicit.als.AlternatingLeastSquares(factors=50) model.fit(sparse_matrix)

Initially I tried to make BigQuery do it for me with its native model:

CREATE OR REPLACE MODEL
`deepnote-200602.python_packages.used_together_model`
OPTIONS(MODEL_TYPE = 'MATRIX_FACTORIZATION'
  , FEEDBACK_TYPE = 'IMPLICIT' 
  , NUM_FACTORS = 50
  , USER_COL = 'file_id'
  , ITEM_COL = 'package'
  , RATING_COL = 'rating')
AS SELECT
  file_id,
  package,
  rating
FROM
  `deepnote-200602.python_packages.used_together` 

But that didn't work, because sadly Google decided to make it pretty hostile to do that.

Define the predict function

import pandas as pd def similar_items(package_name, n_items=20): index = package_c.categories.get_loc(package_name) items = pd.DataFrame(model.similar_items(index, n_items), columns=['item_number', 'score']) items['package'] = items.apply(lambda row: package_c.categories[row['item_number']], axis=1) return list(items.package)

Batch predict for all of the packages we have

import json with open('recommendations.json', 'w') as f: for i, package in enumerate(package_c.categories): reco = {"package": package, "items": similar_items(package)} f.write(json.dumps(reco) + "\n") if i % 1000 == 0: print(f"On {i}th row")

3. Serving the results

You can explore the results on Package Wiki. I plan to add some additional insights shortly.

It reads the information from Firestore, this is how I loaded the data. I also add summaries of the related packages.*

*Because Firestore isn't relational and has no joins, so it's easier to just store a copy of the summaries. Postgres would also make sense I guess.

query = """ SELECT name, ANY_VALUE(summary) AS summary FROM `the-psf.pypi.distribution_metadata` GROUP BY name """ query_job = bq_client.query(query) summaries = query_job.to_dataframe() summaries_dict = summaries.set_index('name').to_dict('index')
from google.cloud import firestore from google.api_core.exceptions import NotFound # I'm going to sneakily re-use the BQ credentials, because I can firestore_client = firestore.Client(credentials=bq_credentials, project=bq_credentials.project_id) collection = firestore_client.collection("python_packages_v2") with open('recommendations.json', 'r') as f: for i, line in enumerate(f): recos = json.loads(line.strip()) firestore_id = recos["package"] recos_with_summaries = [{ "name": package, "summary": summaries_dict.get(package, {}).get('summary', '')} for package in recos["items"] ] data = {"recommendations": recos_with_summaries} try: collection.document(firestore_id).update(data) except NotFound: collection.document(firestore_id).set(data) if i % 1000 == 0: print(f"On {i}th row")

Evaluation - explore the results

I didn't split my dataset and didn't check any error metrics because I don't care. I just wanted to explore.

My notes and ideas so far:

  • It's interesting that out of over 250,000 packages, less than 20,000 make it into public requirements.txt on GitHub.
  • numpy has some pretty weird packages recommended, I expected to know more of them, and to see things like scipy or pandas.
  • Often people create requirements by running pip freeze > requirements.txt, so the recommendations are full of dependencies. I might try filtering those out next time.

Did you find anything interesting? Shoot me a message.

Recommended on Deepnote

Stock Market Analysis

Stock Market Analysis

Last update a month ago
The 10 Best Ways to Create NumPy Arrays

The 10 Best Ways to Create NumPy Arrays

Last update 2 months ago
Wide Residual Networks

Wide Residual Networks

Last update 3 months ago