GitHub on Hacker News trends analysis

Analysis of GitHub related topics and how they perform on Hacker News

from __future__ import print_function import pandas as pd
## Input parameters
project_id = "your-project-id" process_date = "2018-12-01" process_date_nodash = "20181201"
## Exploratory Data Analysis
## Github activity data - Link: [Data](https://bigquery.cloud.google.com/table/githubarchive:day.20181230) - [More info](https://blog.github.com/2017-01-19-github-data-ready-for-you-to-explore-with-bigquery/)
### Different event type in Gihub activity - [Event Types & Payloads](https://developer.github.com/v3/activity/events/types/) explaination
query = """ SELECT type, COUNT(*) AS cnt FROM `githubarchive.day.{0}` GROUP BY 1 ORDER BY 2 DESC """.format(process_date_nodash) print (query) df = pd.read_gbq(query, project_id=project_id, dialect='standard') df.head(20)
### Top 10 repos with the most comments in their issues - __IssueCommentEvent__: Triggered when an issue comment is created, edited, or deleted.
query = """ SELECT repo.name, COUNT(*) AS cnt FROM `githubarchive.day.{0}` WHERE type IN ( 'IssueCommentEvent') GROUP BY 1 ORDER BY 2 DESC LIMIT 10 """.format(process_date_nodash) print (query) df = pd.read_gbq(query, project_id=project_id, dialect='standard') df.head(20)
### Top 10 repos by stars and fork event
query = """ SELECT repo.name, SUM(IF(type='WatchEvent', 1, NULL)) AS stars, SUM(IF(type='ForkEvent', 1, NULL)) AS forks, COUNT(*) AS cnt FROM `githubarchive.day.{0}` WHERE type IN ('WatchEvent','ForkEvent') GROUP BY 1 ORDER BY 2 DESC LIMIT 10 """.format(process_date_nodash) print (query) df = pd.read_gbq(query, project_id=project_id, dialect='standard') df.head(20)
## Hacker News data - Link: [Data](https://bigquery.cloud.google.com/table/bigquery-public-data:hacker_news.full) - [More info](https://medium.com/@hoffa/hacker-news-on-bigquery-now-with-daily-updates-so-what-are-the-top-domains-963d3c68b2e2)
### Top domains shared in Hacker News - Domain with higher score are more likely to make it to the front page. - __nytimes__ has the highest average score.
query = """ SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') as domain, AVG(score) as avg_score, COUNT(*) AS cnt FROM `bigquery-public-data.hacker_news.full` WHERE url!='' AND EXTRACT(DATE FROM timestamp)="{0}" GROUP BY 1 ORDER BY 3 DESC LIMIT 10 """.format(process_date) print (query) df = pd.read_gbq(query, project_id=project_id, dialect='standard') df.head(20)
### What domains have the best chance of getting more than 40 upvotes? - Certainly Hacker News likes content hosted on sites like github.com and the nytimes.
query = """ SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') as domain, COUNTIF(score>40) as score_gt_40, COUNT(*) AS cnt FROM `bigquery-public-data.hacker_news.full` WHERE url!='' AND EXTRACT(DATE FROM timestamp)="{0}" GROUP BY 1 ORDER BY 2 DESC LIMIT 10 """.format(process_date) print (query) df = pd.read_gbq(query, project_id=project_id, dialect='standard') df.head(20)
### Top 10 Hacker news stories from Github by highest score
query = """ SELECT `by` AS submitter, id as story_id, REGEXP_EXTRACT(url, "(https?://github.com/[^/]*/[^/#?]*)") as url, SUM(score) as score FROM `bigquery-public-data.hacker_news.full` WHERE type = 'story' AND EXTRACT(DATE FROM timestamp)='{0}' AND url LIKE '%https://github.com%' AND url NOT LIKE '%github.com/blog/%' GROUP BY submitter, story_id, url ORDER BY score DESC """.format(process_date) print (query) df = pd.read_gbq(query, project_id=project_id, dialect='standard') df.head(10)
## Example Final table: GitHub on Hacker News Trends of 2018-12-01
query = """ WITH github_activity AS ( SELECT repo.name as repo, CONCAT('https://github.com/', repo.name) as url, SUM(IF(type='WatchEvent', 1, NULL)) AS stars, SUM(IF(type='ForkEvent', 1, NULL)) AS forks, COUNT(*) AS cnt FROM `githubarchive.day.{0}` WHERE type IN ('WatchEvent','ForkEvent') GROUP BY 1,2 ), hacker_news AS ( SELECT EXTRACT(DATE FROM timestamp) as date, `by` AS submitter, id as story_id, REGEXP_EXTRACT(url, "(https?://github.com/[^/]*/[^/#?]*)") as url, SUM(score) as score FROM `bigquery-public-data.hacker_news.full` WHERE type = 'story' AND EXTRACT(DATE FROM timestamp)='{1}' AND url LIKE '%https://github.com%' AND url NOT LIKE '%github.com/blog/%' GROUP BY 1,2,3,4 ) SELECT a.date as date, a.url as github_url, b.repo as github_repo, a.score as hn_score, a.story_id as hn_story_id, b.stars as stars, b.forks as forks FROM hacker_news as a LEFT JOIN github_activity as b ON a.url=b.url ORDER BY hn_score DESC LIMIT 10 """.format(process_date_nodash, process_date) print (query) df = pd.read_gbq(query, project_id=project_id, dialect='standard') df.head(10)
## Python PyPI stats - The Python Software Foundation provides the raw logs of Python installation activitie - Link: [Data](https://bigquery.cloud.google.com/table/the-psf:pypi.downloads20181230) - [More info](https://packaging.python.org/guides/analyzing-pypi-package-downloads/) __Challenge__: - Find associated Github stars, fork event, and Hacker News story for top downloads Python packages from pip
### Top 10 downloads packages from pip
query = """ SELECT TIMESTAMP_TRUNC(timestamp, DAY) as day, file.project as project, COUNT(*) as downloads FROM `the-psf.pypi.downloads{0}` WHERE details.installer.name = 'pip' GROUP BY 1,2 ORDER BY 3 desc """.format(process_date_nodash) print (query) df = pd.read_gbq(query, project_id=project_id, dialect='standard') df.head(10)
### What is the number stars and fork event for botocore?
query = """ SELECT repo.name, SUM(IF(type='WatchEvent', 1, NULL)) AS stars, SUM(IF(type='ForkEvent', 1, NULL)) AS forks, COUNT(*) AS cnt FROM `githubarchive.day.{0}` WHERE type IN ('WatchEvent','ForkEvent') AND repo.name LIKE "%botocore%" GROUP BY 1 ORDER BY 2 DESC LIMIT 10 """.format(process_date_nodash) print (query) df = pd.read_gbq(query, project_id=project_id, dialect='standard') df.head(20)
## Resources - [GitHub data, ready for you to explore with BigQuery](https://blog.github.com/2017-01-19-github-data-ready-for-you-to-explore-with-bigquery/) - [Hacker News on BigQuery](https://medium.com/@hoffa/hacker-news-on-bigquery-now-with-daily-updates-so-what-are-the-top-domains-963d3c68b2e2) - [Analyzing PyPI package downloads](https://packaging.python.org/guides/analyzing-pypi-package-downloads/)

Run this article as a notebook

Deepnote is a new kind of data science notebook. Jupyter-compatible and with real-time collaboration.

Sign-up for the waitlist below, or find out more here.

To be continued...