--- title: Scraping FRC team's GitHub accounts to gather large amounts of data description: There are a lot of teams... date: 2019-07-06 tags: frc aliases: - /blog/2019/07/06/scrapingfrcgithub - /blog/scrapingfrcgithub --- I was curious about the most used languages for FRC, so I build a Python script to find out what they where. ## Some basic data Before we get to the heavy work done by my script, let's start with some general data. Thanks to the [TBA API](https://www.thebluealliance.com/apidocs/v3), I know that there are 6917 registered teams. 492 of them have registered at least one account on GitHub. ## How the script works The script is split into steps: - Get a list of every registered team - Check for a github account attached to every registered team - If a team has an account, it is added to the dataset - Load each github profile - If it is a private account, move on - Use Regex to find all languages used - Compile data and sort ### Getting a list of accounts This is probably the simplest step in the whole process. I used the auto-generated [tbaapiv3client](https://github.com/TBA-API/tba-api-client-python) python library's `get_teams_keys(key)` function, and kept incrementing `key` until I got an empty array. All returned data was then added together into a big list of team keys. ### Checking for a team's github account The [TBA API](https://www.thebluealliance.com/apidocs/v3) helpfully provides a `/api/v3/team//social_media` API endpoint that will give the GitHub username for any team you request. (or nothing if they don't use github) A `for` loop on this with a list of every team number did the trick for finding accounts. ### Fetching language info To remove the need for an Oauth login to use the script, GitHub data is retrieved using standard HTTPS requests instead of AJAX requests to the API. This gets around the tiny rate limit, but takes a bit longer to complete. To check for language usage, a simple Regex pattern can be used: `/programmingLanguage"\>(.*)\