Make large-scale analysis of Python code several orders of magnitude quicker

🐌

Tom Forbes

$1,000raised

$1,000valuation

Project description

Python is one of the most used programming languages in the world, and the Python Package Index contains nearly half a million projects with over 14 terrabytes of published code.

Exploring the contents of PyPi is currently impractical due to the total size, API, and complexity of Python packages. This means it is harder than it should be to:

Detect malicious packages
Quantitatively understand how the Python ecosystem is evolving
Detect credentials that have been accidentally published.

I will develop a system that will make it possible to explore the total set of all Python code published on PyPi with consumer-grade hardware + internet in a matter of minutes rather than weeks. This will be achieved by leveraging Git Packfiles to massively compress huge quantities of PyPi code into manageable chunks (~14tb to ~100gb), distributing them via Github or another service and then providing tooling to explore/analyse the contained code. I will leverage third party secret scanning/detection services to scan the contents for leaked credentials, and notify the owners automatically.

This won't change the world, cure malaria or save a life. But it will have a positive impact on the security of the internet (specifically companies that publish code to PyPi) and hopefully on the evolution of one of the worlds most popular programming language.

What is your track record on similar projects?

I've built a system that trawls PyPi releases for AWS keys in real time, and I've been working on a proof of concept for this project. It's more than possible and I've already found a worrying amount of credentials leaked on PyPi.

How will you spend your funding?

I need to rent some beefy server hardware to bootstrap the project and create the initial packfiles, and I need some compensation for the time it will take to work on this project.

holds 0%

🐌

Tom Forbes

almost 2 years ago

Final report

Description of subprojects and results, including major changes from the original proposal

Python code is mirrored to GitHub here: https://github.com/pypi-data

Auto updates website is here: https://py-code.org/
secrets are being reported to the PSF.

Spending breakdown

Domain name: 10$

GitHub subscription for increased CI capacity: 50$

Server costs for initial development: 300$

holds 0%

🐌

Tom Forbes

almost 2 years ago

Progress update

What progress have you made since your last update?

The project is now complete! All PyPI code is mirrored to GitHub, and new uploads are automatically added as well.

holds 0%

🐌

Tom Forbes

over 2 years ago

Update: This project was pretty delayed due to unrelated personal issues, but I completed the first stage last week: https://github.com/pypi-data

All the Python code up until the 8th of July is within this organisation. It's not currently live updating but the infrastructure is all there.

I'm working with https://www.gitguardian.com/ to scan the contents for credentials, then produce some analysis on the contents, documentation and finally publicity.

Thank you for your funding Austin! This helped pay for some infrastructure to set up the project and begin scoping out a way to do this as well as give me motivation to continue.