Make large-scale analysis of Python code several orders of magnitude quicker

Project description

Python is one of the most used programming languages in the world, and the Python Package Index contains nearly half a million projects with over 14 terrabytes of published code.

Exploring the contents of PyPi is currently impractical due to the total size, API, and complexity of Python packages. This means it is harder than it should be to:

Detect malicious packages
Quantitatively understand how the Python ecosystem is evolving
Detect credentials that have been accidentally published.

I will develop a system that will make it possible to explore the total set of all Python code published on PyPi with consumer-grade hardware + internet in a matter of minutes rather than weeks. This will be achieved by leveraging Git Packfiles to massively compress huge quantities of PyPi code into manageable chunks (~14tb to ~100gb), distributing them via Github or another service and then providing tooling to explore/analyse the contained code. I will leverage third party secret scanning/detection services to scan the contents for leaked credentials, and notify the owners automatically.

This won't change the world, cure malaria or save a life. But it will have a positive impact on the security of the internet (specifically companies that publish code to PyPi) and hopefully on the evolution of one of the worlds most popular programming language.

What is your track record on similar projects?

I've built a system that trawls PyPi releases for AWS keys in real time, and I've been working on a proof of concept for this project. It's more than possible and I've already found a worrying amount of credentials leaked on PyPi.

How will you spend your funding?

I need to rent some beefy server hardware to bootstrap the project and create the initial packfiles, and I need some compensation for the time it will take to work on this project.