GitHub

Thesis Work Roadmap

Select list of packages (python libraries) a. Run this query on BQ to bbtain monthly pip python library download data from Google
```
SELECT
  file.project as `project`,
  COUNT(*) AS num_downloads,
  DATE_TRUNC(DATE(timestamp), MONTH) AS `month`
FROM `bigquery-public-data.pypi.file_downloads`
WHERE DATE(timestamp)
  BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 180 MONTH), MONTH)
    AND CURRENT_DATE()
GROUP BY `month`, `project`
ORDER BY `month` DESC
```
b. Download result from above query as csv and move it to data/queries/package_downloads.csv

c. Run output_analysis.ipynb to turn produce our selected list of python packages, data/inputs/packages_filtered.csv

d. Run pypi_json.sh to download json files for pip packages into data/pip_json

e. Run link_packages.ipynbs to obtain github information associated with each pip package, stored in data/inputs/package_repos.csv

I split up the queries for github data. Prior to 2018/09, pip download data either doesn't exist (prior to 2016) or cannot be compared to data post 2018/09

a. Import data/inputs/package_repos.csv as github_pip in your project

b. GitHub Data can be obtained using the following query. Note that thesis-402503 is my own personal project id and you'll have to change thesis-402503 to your own id

SELECT *
  FROM `githubarchive.month.20*`
  WHERE (_TABLE_SUFFIX BETWEEN '1101' AND '2403') AND repo.name in 
    (SELECT repo FROM `thesis-402503.python_libraries.github_pip`)

Then, I saved the data as a BQ table called filtered_table. Then, I partitioned it and saved it into a table called partitioned_filtered

CREATE TABLE `thesis-402503.python_libraries.partitioned_filtered`
PARTITION BY RANGE_BUCKET(export_id, GENERATE_ARRAY(0, 1000, 1))
CLUSTER BY export_id
AS (
  SELECT *, CAST(FLOOR(1000*RAND()) AS INT64) AS export_id
  FROM `thesis-402503.python_libraries.filtered_table_pre18`
);

c. Create a folder in a gcloud bucket (mine the folder filterd_github_data_large in the gcloud bucket thesis-github) and export the data

Data from 2018/09 to 2023/08

EXPORT DATA 
OPTIONS (uri='gs://thesis-github/filtered_github_data_large/partitions*.json', format='JSON')AS 
(SELECT * 
FROM `thesis-402503.python_libraries.filtered_table`)

Data from 2011/01 to 2018/08

EXPORT DATA 
OPTIONS (uri='gs://thesis-github/github_data_pre_18/github_data_pre18*.json', format='json')
AS (SELECT * FROM `thesis-402503.python_libraries.partition_filtered_table_pre18`)

d. Install gsutil/gcloud (follow instructions here), then run the below command to download the raw data to your local machine

Data from 2018/09 to 2023/08

gsutil -m cp -r gs://thesis-github/filtered_github_data_large data/github_raw/

Data from 2011/01 to 2018/08

gsutil -m cp -r gs://thesis-github/github_data_pre_18 data/github_raw/

Clean downloaded results

a. Transforming raw JSONs into csv's. Both of the below scripts delete the raw json data and replace it with cleaned data in data/github_clean/filtered_github_data_large and data/github_clean/filtered_github_data_pre_18, respectively
- bash clean_github_daha.sh (from home directory undergrad_thesis) cleans all the data
b. obtain_commit_data_pr - gets commit data for PRs, commit data for pushes already obtained in part a

c. collect_commit_data.sh and collect_commit_data_pr.sh collect commit data for pushes and pull requests

e. match_committers_pr.py and match_committers_push.py link commit authors/committers to github id's

data/queries contains results from google bigquery data/inputs contains data we use as inputs for google bigquery or other files data/pip_json contains json data we downloaded in order to learn about pip packages data/github_raw contains downlaoded data from github data/github_clean contains data cleaned from data/github_raw data/github_commits contains commit data cleaned from data/clean data/package_downloads contains pip package downloads with version info at the daily level repos is a temporary storage location we use when cloning github repos to extract git commit information

d. aggregating_push_data.sh and aggregating_pr_data.sh and clean pull request and push data

undergrad_thesis/data/inputs/free_email_domains.txt is from https://gist.github.com/humphreybc/d17e9215530684d6817ebe197c94a76b

downloaded languages file to figure out which projects have python as the dominant language

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
data		data
descriptives		descriptives
results/data		results/data
.aider.chat.history.md		.aider.chat.history.md
.gitignore		.gitignore
Open Source Copilot Survey (Responses) - Form Responses 1.csv		Open Source Copilot Survey (Responses) - Form Responses 1.csv
README.md		README.md
addPRIssueData.ipynb		addPRIssueData.ipynb
addPRIssueDataLinks.ipynb		addPRIssueDataLinks.ipynb
addPRIssueDataLinks.py		addPRIssueDataLinks.py
add_pr_data.sh		add_pr_data.sh
aggregating_issue_data.ipynb		aggregating_issue_data.ipynb
aggregating_pr_data.ipynb		aggregating_pr_data.ipynb
aggregating_pr_data.py		aggregating_pr_data.py
aggregating_pr_data.sh		aggregating_pr_data.sh
aggregating_push_data.ipynb		aggregating_push_data.ipynb
aggregating_push_data.py		aggregating_push_data.py
aggregating_push_data.sh		aggregating_push_data.sh
analysis.ipynb		analysis.ipynb
analyzeData.ipynb		analyzeData.ipynb
clean_github_data.sh		clean_github_data.sh
collectCompanyEmails.ipynb		collectCompanyEmails.ipynb
collect_commit_data.py		collect_commit_data.py
collect_commit_data.sh		collect_commit_data.sh
collect_commit_data_pr.py		collect_commit_data_pr.py
collect_commit_data_pr.sh		collect_commit_data_pr.sh
contributor_list.ipynb		contributor_list.ipynb
convert_csv_parquet_TEMP.ipynb		convert_csv_parquet_TEMP.ipynb
df_active_full_pop.parquet		df_active_full_pop.parquet
dominantLanguage.ipynb		dominantLanguage.ipynb
file_partitions.ipynb		file_partitions.ipynb
github_data_cleaning.ipynb		github_data_cleaning.ipynb
github_data_cleaning.py		github_data_cleaning.py
hierarchyAnalysis.ipynb		hierarchyAnalysis.ipynb
hierarchyAnalysisIssues.ipynb		hierarchyAnalysisIssues.ipynb
hierarchyAnalysisIssues.py		hierarchyAnalysisIssues.py
hierarchyAnalysisOld.ipynb		hierarchyAnalysisOld.ipynb
hierarchyAnalysisRDD.ipynb		hierarchyAnalysisRDD.ipynb
identifyInstitutions.ipynb		identifyInstitutions.ipynb
incentives_analysis_failed.ipynb		incentives_analysis_failed.ipynb
incentives_analysis_failed_2.ipynb		incentives_analysis_failed_2.ipynb
issue_clean.py		issue_clean.py
issue_clean.sh		issue_clean.sh
issuesAnalysis.ipynb		issuesAnalysis.ipynb
issuesAnalysisBackingOutRank.ipynb		issuesAnalysisBackingOutRank.ipynb
issuesAnalysisBackingOutRank.py		issuesAnalysisBackingOutRank.py
issuesAnalysisBackingOutRank.sh		issuesAnalysisBackingOutRank.sh
issuesAnalysis_removebad.ipynb		issuesAnalysis_removebad.ipynb
link_packages.ipynb		link_packages.ipynb
match_committers.sh		match_committers.sh
match_committers_pr.py		match_committers_pr.py
match_committers_push.py		match_committers_push.py
merged_data_analysis.ipynb		merged_data_analysis.ipynb
new_issue_clean.py		new_issue_clean.py
no_copilot_project.parquet		no_copilot_project.parquet
obtain_commit_data_pr.py		obtain_commit_data_pr.py
obtain_commit_data_pr.sh		obtain_commit_data_pr.sh
output_analysis.ipynb		output_analysis.ipynb
plots.R		plots.R
plots_ind.R		plots_ind.R
pypi_json.sh		pypi_json.sh
should_have_copilot.parquet		should_have_copilot.parquet
surveyAnalysis.ipynb		surveyAnalysis.ipynb
survey_emails.csv		survey_emails.csv
temp.r		temp.r
untitled.py		untitled.py

liaochris/undergrad_thesis

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages