1

frc-stats

This commit is contained in:
Evan Pratten 2019-07-06 19:19:02 -04:00
parent c9b53149be
commit 43cbaf7cb4
5 changed files with 461 additions and 56 deletions

View File

@ -0,0 +1,102 @@
---
layout: post
title: "Scraping FRC team's GitHub accounts to gather large amounts of data"
description: "There are a lot of teams..."
date: 2019-07-06 15:08:00
categories: frc
---
I was curious about the most used languages for FRC, so I build a Python script to find out what they where.
## Some basic data
Before we get to the heavy work done by my script, let's start with some general data.
Thanks to the [TBA API](https://www.thebluealliance.com/apidocs/v3), I know that there are 6917 registered teams. 492 of them have registered at least one account on GitHub.
## How the script works
The script is split into steps:
- Get a list of every registered team
- Check for a github account attached to every registered team
- If a team has an account, it is added to the dataset
- Load each github profile
- If it is a private account, move on
- Use Regex to find all languages used
- Compile data and sort
### Getting a list of accounts
This is probably the simplest step in the whole process. I used the auto-generated [tbaapiv3client](https://github.com/TBA-API/tba-api-client-python) python library's `get_teams_keys(key)` function, and kept incrementing `key` until I got an empty array. All returned data was then added together into a big list of team keys.
### Checking for a team's github account
The [TBA API](https://www.thebluealliance.com/apidocs/v3) helpfully provides a `/api/v3/team/<number>/social_media` API endpoint that will give the GitHub username for any team you request. (or nothing if they don't use github)
A `for` loop on this with a list of every team number did the trick for finding accounts.
### Fetching language info
To remove the need for an Oauth login to use the script, GitHub data is retrieved using standard HTTPS requests instead of AJAX requests to the API. This gets around the tiny rate limit, but takes a bit longer to complete.
To check for language usage, a simple Regex pattern can be used: `/programmingLanguage"\>(.*)\</gm`
When combined with an `re.findall()`, this pattern will return a list of all recent languages used by a team.
### Data saves / backup solution
To deal with the fact that large amounts of data are being requested, and people might want to pause the script, I have created a system to allow for "savestates".
On launch of the script, it will check for a `./data.json` file. If this does not exist, one will be created. Otherwise, the contents will be read. This file contains both all the saved data, and some counters.
Each stage of the script contains a counter, and will increment the counter every time a team has been processed. This way, if the script is stopped and restarted, the parsers will just keep working from where they left off. This was very helpful when writing the script as, I needed to stop and start it every time I needed to implement a new feature.
All parsing data is saved to the json file every time the script completes, or it detects a `SIGKILL`.
## What I learned
After letting the script run for about an hour, I got a bunch of data from every registered team.
This data includes every project (both on and offseason) from each team, so teams that build t-shirt cannons using the CTRE HERO, would have C# in their list of languages. Things like that.
Unsurprisingly, by far the most popular programming language is Java, with 3232 projects. These projects where all mostly, or entirely written in Java. Next up, we have C++ with 725 projects, and Python with 468 projects.
After Java, C++, and Python, we start running in to languages used for dashboards, design, lessons, and offseason projects. Before I get to everything else, here is the usage of the rest of the valid languages for FRC robots:
- C (128)
- LabView (153)
- Kotlin (96)
- Rust (4)
Now, the rest of the languages below Python:
```
295 occurrences of JavaScript
153 occurrences of LabVIEW
128 occurrences of C
96 occurrences of Kotlin
72 occurrences of Arduino
71 occurrences of C#
69 occurrences of CSS
54 occurrences of PHP
40 occurrences of Shell
34 occurrences of Ruby
16 occurrences of Swift
16 occurrences of Jupyter Notebook
15 occurrences of Scala
12 occurrences of D
12 occurrences of TypeScript
9 occurrences of Dart
8 occurrences of Processing
7 occurrences of CoffeeScript
6 occurrences of Go
6 occurrences of Groovy
6 occurrences of Objective-C
4 occurrences of Rust
3 occurrences of MATLAB
3 occurrences of R
1 occurrences of Visual Basic
1 occurrences of Clojure
1 occurrences of Cuda
```
I have removed markup and shell languages from that list because most of them are probably auto-generated.
In terms of github account names, 133 teams follow FRC convention and use a username starting with `frc`, followed by their team number, 95 teams use `team` then their number, and 264 teams use something else.
## Using the script
This script is not on PYPI this time. You can obtain a copy from my GitHub repo: [https://github.com/Ewpratten/frc-code-stats](https://github.com/Ewpratten/frc-code-stats)
First, make sure both `python3.7` and `python3-pip` are installed on your computer. Next, delete the `data.json` file. Then, install the requirements with `pip3 install -r requirements.txt`. Finally, run with `python3 main.py` to start the script. Now, go outside and enjoy nature for about an hour, and your data should be loaded!.

View File

@ -92,6 +92,13 @@
<ul>
<!-- <header class="major"> -->
<li><h3><a href="/frc/2019/07/06/ScrapingFRCGithub.html" class="link" title="2019-07-06 11:08:00 -0400">Scraping FRC team's GitHub accounts to gather large amounts of data</a></h3></li>
<!-- </header> -->
<!-- <header class="major"> -->
<li><h3><a href="/projects/2019/07/01/devDNS.html" class="link" title="2019-07-01 18:13:00 -0400">devDNS</a></h3></li>
<!-- </header> -->

View File

@ -1,4 +1,105 @@
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.8.5">Jekyll</generator><link href="http://0.0.0.0:4000/feed.xml" rel="self" type="application/atom+xml" /><link href="http://0.0.0.0:4000/" rel="alternate" type="text/html" /><updated>2019-07-01T22:38:02-04:00</updated><id>http://0.0.0.0:4000/feed.xml</id><title type="html">Evan Pratten</title><subtitle>Computer wizard, student, &lt;a href=&quot;https://github.com/frc5024&quot;&gt;@frc5024&lt;/a&gt; programming team lead, and radio enthusiast.</subtitle><entry><title type="html">devDNS</title><link href="http://0.0.0.0:4000/projects/2019/07/01/devDNS.html" rel="alternate" type="text/html" title="devDNS" /><published>2019-07-01T18:13:00-04:00</published><updated>2019-07-01T18:13:00-04:00</updated><id>http://0.0.0.0:4000/projects/2019/07/01/devDNS</id><content type="html" xml:base="http://0.0.0.0:4000/projects/2019/07/01/devDNS.html">&lt;p&gt;Over the past year and a half, I have been hacking my way around the undocumented &lt;a href=&quot;https://devrant.com&quot;&gt;devRant&lt;/a&gt; auth/write API. At the request of devRants creators, this API must not be documented due to the way logins work on the platform. That is besides the point. I have been working on a little project called &lt;a href=&quot;https://devrant.com/collabs/2163502&quot;&gt;devDNS&lt;/a&gt; over the past few days that uses this undocumented API. Why must I be so bad at writing intros?&lt;/p&gt;
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.8.5">Jekyll</generator><link href="http://0.0.0.0:4000/feed.xml" rel="self" type="application/atom+xml" /><link href="http://0.0.0.0:4000/" rel="alternate" type="text/html" /><updated>2019-07-06T19:18:50-04:00</updated><id>http://0.0.0.0:4000/feed.xml</id><title type="html">Evan Pratten</title><subtitle>Computer wizard, student, &lt;a href=&quot;https://github.com/frc5024&quot;&gt;@frc5024&lt;/a&gt; programming team lead, and radio enthusiast.</subtitle><entry><title type="html">Scraping FRC teams GitHub accounts to gather large amounts of data</title><link href="http://0.0.0.0:4000/frc/2019/07/06/ScrapingFRCGithub.html" rel="alternate" type="text/html" title="Scraping FRC team's GitHub accounts to gather large amounts of data" /><published>2019-07-06T11:08:00-04:00</published><updated>2019-07-06T11:08:00-04:00</updated><id>http://0.0.0.0:4000/frc/2019/07/06/ScrapingFRCGithub</id><content type="html" xml:base="http://0.0.0.0:4000/frc/2019/07/06/ScrapingFRCGithub.html">&lt;p&gt;I was curious about the most used languages for FRC, so I build a Python script to find out what they where.&lt;/p&gt;
&lt;h2 id=&quot;some-basic-data&quot;&gt;Some basic data&lt;/h2&gt;
&lt;p&gt;Before we get to the heavy work done by my script, lets start with some general data.&lt;/p&gt;
&lt;p&gt;Thanks to the &lt;a href=&quot;https://www.thebluealliance.com/apidocs/v3&quot;&gt;TBA API&lt;/a&gt;, I know that there are 6917 registered teams. 492 of them have registered at least one account on GitHub.&lt;/p&gt;
&lt;h2 id=&quot;how-the-script-works&quot;&gt;How the script works&lt;/h2&gt;
&lt;p&gt;The script is split into steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Get a list of every registered team&lt;/li&gt;
&lt;li&gt;Check for a github account attached to every registered team
&lt;ul&gt;
&lt;li&gt;If a team has an account, it is added to the dataset&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Load each github profile
&lt;ul&gt;
&lt;li&gt;If it is a private account, move on&lt;/li&gt;
&lt;li&gt;Use Regex to find all languages used&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Compile data and sort&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;getting-a-list-of-accounts&quot;&gt;Getting a list of accounts&lt;/h3&gt;
&lt;p&gt;This is probably the simplest step in the whole process. I used the auto-generated &lt;a href=&quot;https://github.com/TBA-API/tba-api-client-python&quot;&gt;tbaapiv3client&lt;/a&gt; python librarys &lt;code class=&quot;highlighter-rouge&quot;&gt;get_teams_keys(key)&lt;/code&gt; function, and kept incrementing &lt;code class=&quot;highlighter-rouge&quot;&gt;key&lt;/code&gt; until I got an empty array. All returned data was then added together into a big list of team keys.&lt;/p&gt;
&lt;h3 id=&quot;checking-for-a-teams-github-account&quot;&gt;Checking for a teams github account&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://www.thebluealliance.com/apidocs/v3&quot;&gt;TBA API&lt;/a&gt; helpfully provides a &lt;code class=&quot;highlighter-rouge&quot;&gt;/api/v3/team/&amp;lt;number&amp;gt;/social_media&lt;/code&gt; API endpoint that will give the GitHub username for any team you request. (or nothing if they dont use github)&lt;/p&gt;
&lt;p&gt;A &lt;code class=&quot;highlighter-rouge&quot;&gt;for&lt;/code&gt; loop on this with a list of every team number did the trick for finding accounts.&lt;/p&gt;
&lt;h3 id=&quot;fetching-language-info&quot;&gt;Fetching language info&lt;/h3&gt;
&lt;p&gt;To remove the need for an Oauth login to use the script, GitHub data is retrieved using standard HTTPS requests instead of AJAX requests to the API. This gets around the tiny rate limit, but takes a bit longer to complete.&lt;/p&gt;
&lt;p&gt;To check for language usage, a simple Regex pattern can be used: &lt;code class=&quot;highlighter-rouge&quot;&gt;/programmingLanguage&quot;\&amp;gt;(.*)\&amp;lt;/gm&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;When combined with an &lt;code class=&quot;highlighter-rouge&quot;&gt;re.findall()&lt;/code&gt;, this pattern will return a list of all recent languages used by a team.&lt;/p&gt;
&lt;h3 id=&quot;data-saves--backup-solution&quot;&gt;Data saves / backup solution&lt;/h3&gt;
&lt;p&gt;To deal with the fact that large amounts of data are being requested, and people might want to pause the script, I have created a system to allow for “savestates”.&lt;/p&gt;
&lt;p&gt;On launch of the script, it will check for a &lt;code class=&quot;highlighter-rouge&quot;&gt;./data.json&lt;/code&gt; file. If this does not exist, one will be created. Otherwise, the contents will be read. This file contains both all the saved data, and some counters.&lt;/p&gt;
&lt;p&gt;Each stage of the script contains a counter, and will increment the counter every time a team has been processed. This way, if the script is stopped and restarted, the parsers will just keep working from where they left off. This was very helpful when writing the script as, I needed to stop and start it every time I needed to implement a new feature.&lt;/p&gt;
&lt;p&gt;All parsing data is saved to the json file every time the script completes, or it detects a &lt;code class=&quot;highlighter-rouge&quot;&gt;SIGKILL&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&quot;what-i-learned&quot;&gt;What I learned&lt;/h2&gt;
&lt;p&gt;After letting the script run for about an hour, I got a bunch of data from every registered team.&lt;/p&gt;
&lt;p&gt;This data includes every project (both on and offseason) from each team, so teams that build t-shirt cannons using the CTRE HERO, would have C# in their list of languages. Things like that.&lt;/p&gt;
&lt;p&gt;Unsurprisingly, by far the most popular programming language is Java, with 3232 projects. These projects where all mostly, or entirely written in Java. Next up, we have C++ with 725 projects, and Python with 468 projects.&lt;/p&gt;
&lt;p&gt;After Java, C++, and Python, we start running in to languages used for dashboards, design, lessons, and offseason projects. Before I get to everything else, here is the usage of the rest of the valid languages for FRC robots:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;C (128)&lt;/li&gt;
&lt;li&gt;LabView (153)&lt;/li&gt;
&lt;li&gt;Kotlin (96)&lt;/li&gt;
&lt;li&gt;Rust (4)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now, the rest of the languages below Python:&lt;/p&gt;
&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;295 occurrences of JavaScript
153 occurrences of LabVIEW
128 occurrences of C
96 occurrences of Kotlin
72 occurrences of Arduino
71 occurrences of C#
69 occurrences of CSS
54 occurrences of PHP
40 occurrences of Shell
34 occurrences of Ruby
16 occurrences of Swift
16 occurrences of Jupyter Notebook
15 occurrences of Scala
12 occurrences of D
12 occurrences of TypeScript
9 occurrences of Dart
8 occurrences of Processing
7 occurrences of CoffeeScript
6 occurrences of Go
6 occurrences of Groovy
6 occurrences of Objective-C
4 occurrences of Rust
3 occurrences of MATLAB
3 occurrences of R
1 occurrences of Visual Basic
1 occurrences of Clojure
1 occurrences of Cuda
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I have removed markup and shell languages from that list because most of them are probably auto-generated.&lt;/p&gt;
&lt;p&gt;In terms of github account names, 133 teams follow FRC convention and use a username starting with &lt;code class=&quot;highlighter-rouge&quot;&gt;frc&lt;/code&gt;, followed by their team number, 95 teams use &lt;code class=&quot;highlighter-rouge&quot;&gt;team&lt;/code&gt; then their number, and 264 teams use something else.&lt;/p&gt;
&lt;h2 id=&quot;using-the-script&quot;&gt;Using the script&lt;/h2&gt;
&lt;p&gt;This script is not on PYPI this time. You can obtain a copy from my GitHub repo: &lt;a href=&quot;https://github.com/Ewpratten/frc-code-stats&quot;&gt;https://github.com/Ewpratten/frc-code-stats&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;First, make sure both &lt;code class=&quot;highlighter-rouge&quot;&gt;python3.7&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;python3-pip&lt;/code&gt; are installed on your computer. Next, delete the &lt;code class=&quot;highlighter-rouge&quot;&gt;data.json&lt;/code&gt; file. Then, install the requirements with &lt;code class=&quot;highlighter-rouge&quot;&gt;pip3 install -r requirements.txt&lt;/code&gt;. Finally, run with &lt;code class=&quot;highlighter-rouge&quot;&gt;python3 main.py&lt;/code&gt; to start the script. Now, go outside and enjoy nature for about an hour, and your data should be loaded!.&lt;/p&gt;</content><author><name></name></author><summary type="html">I was curious about the most used languages for FRC, so I build a Python script to find out what they where.</summary></entry><entry><title type="html">devDNS</title><link href="http://0.0.0.0:4000/projects/2019/07/01/devDNS.html" rel="alternate" type="text/html" title="devDNS" /><published>2019-07-01T18:13:00-04:00</published><updated>2019-07-01T18:13:00-04:00</updated><id>http://0.0.0.0:4000/projects/2019/07/01/devDNS</id><content type="html" xml:base="http://0.0.0.0:4000/projects/2019/07/01/devDNS.html">&lt;p&gt;Over the past year and a half, I have been hacking my way around the undocumented &lt;a href=&quot;https://devrant.com&quot;&gt;devRant&lt;/a&gt; auth/write API. At the request of devRants creators, this API must not be documented due to the way logins work on the platform. That is besides the point. I have been working on a little project called &lt;a href=&quot;https://devrant.com/collabs/2163502&quot;&gt;devDNS&lt;/a&gt; over the past few days that uses this undocumented API. Why must I be so bad at writing intros?&lt;/p&gt;
&lt;h2 id=&quot;what-is-devdns&quot;&gt;What is devDNS&lt;/h2&gt;
&lt;p&gt;devDNS is a devRant bot written in python. It will serve any valid DNS query from any user on the platform. A query is just a comment in one of the following forms:&lt;/p&gt;
@ -377,56 +478,4 @@ https://retrylife.ca/feed.xml
&lt;audio controls=&quot;&quot;&gt;
&lt;source src=&quot;/assets/audio/SpamPhoneCalls.mp3&quot; type=&quot;audio/mpeg&quot; /&gt;
Your browser does not support audio players
&lt;/audio&gt;</content><author><name></name></author><summary type="html">I am currently taking a class in school called Music and computers (AMM2M), where as part of the class, whe get together into bands, and produce a song. After taking a break from music production for over a year, we have released our song for the class (we do two songs, but the second is not finished yet).</summary></entry><entry><title type="html">Graphing the relation between wheels and awards for FRC</title><link href="http://0.0.0.0:4000/frc/2019/06/16/Graphing-w2a.html" rel="alternate" type="text/html" title="Graphing the relation between wheels and awards for FRC" /><published>2019-06-16T11:51:00-04:00</published><updated>2019-06-16T11:51:00-04:00</updated><id>http://0.0.0.0:4000/frc/2019/06/16/Graphing-w2a</id><content type="html" xml:base="http://0.0.0.0:4000/frc/2019/06/16/Graphing-w2a.html">&lt;p&gt;I was scrolling through reddit the other day, and came across &lt;a href=&quot;https://www.reddit.com/r/FRC/comments/byzv5q/i_know_what_im_doing/&quot;&gt;this great post&lt;/a&gt; by u/&lt;a href=&quot;https://www.reddit.com/user/MasterQuacks/&quot;&gt;MasterQuacks&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/w2ainspo.jpg&quot; alt=&quot;My insporation&quot; /&gt;&lt;/p&gt;
&lt;p&gt;I thought to myself “ha. Thats funny”, and moved on. But that thought had stuck with me.&lt;/p&gt;
&lt;p&gt;So here I am, bored on a sunday afternoon, staring at the matplotlib documentation.&lt;/p&gt;
&lt;h2 id=&quot;my-creation&quot;&gt;My creation&lt;/h2&gt;
&lt;p&gt;In only a few lines of python, I have a program that will (badly) graph the number of awards per wheel for any team, or set of teams.&lt;/p&gt;
&lt;p&gt;As always, feel free to tinker with the code. This one is not published anywhere, so if you want to share it, I would appreciate a mention.&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;requests&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Team&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wheels&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt;
&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;wheels&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wheels&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;### CONFIG ###
&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;teams&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Team&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5024&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Team&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;254&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Team&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1114&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Team&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5406&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Team&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2056&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;year&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2019&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;##############
&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;team&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;enumerate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;teams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;award_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;requests&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;https://www.thebluealliance.com/api/v3/team/frc&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;team&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;/awards/&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;year&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;X-TBA-Auth-Key&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;mz0VWTNtXTDV8NNOz3dYg9fHOZw8UYek270gynLQ4v9veaaUJEPvJFCZRmte7AUN&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;awards_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;award_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;team&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;w2a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;awards_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;team&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;wheels&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;team&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;team&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;w2a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;team&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;w2a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tick_label&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;team&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Plot
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x_lables&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;team&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;team&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;teams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# plt.set_xticklabels(x_lables)
&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xkcd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Awards per wheel'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;show&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;h2 id=&quot;the-result&quot;&gt;The result&lt;/h2&gt;
&lt;p&gt;Here is the resulting image. From left, to right: 5024, 254, 2224, 5406, 2056&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/w2a.png&quot; alt=&quot;Thr result&quot; /&gt;&lt;/p&gt;</content><author><name></name></author><summary type="html">I was scrolling through reddit the other day, and came across this great post by u/MasterQuacks.</summary></entry></feed>
&lt;/audio&gt;</content><author><name></name></author><summary type="html">I am currently taking a class in school called Music and computers (AMM2M), where as part of the class, whe get together into bands, and produce a song. After taking a break from music production for over a year, we have released our song for the class (we do two songs, but the second is not finished yet).</summary></entry></feed>

View File

@ -0,0 +1,247 @@
<!DOCTYPE html>
<html>
<head>
<title>Evan Pratten</title>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no" />
<!--[if lte IE 8]><script src="/assets/js/ie/html5shiv.js"></script><![endif]-->
<link rel="stylesheet" href="/assets/css/main.css" />
<!-- <link rel="stylesheet" href="/assets/css/custom.css" /> -->
<!--[if lte IE 9]><link rel="stylesheet" href="/assets/css/ie9.css" /><![endif]-->
<!--[if lte IE 8]><link rel="stylesheet" href="/assets/css/ie8.css" /><![endif]-->
<!-- Syntax highlight -->
<link rel="stylesheet" href="/assets/css/vs.css" />
</head>
<body>
<!-- Wrapper -->
<div id="wrapper">
<!-- Header -->
<header id="header" >
<a href="http://0.0.0.0:4000//" class="logo"><strong>Evan Pratten</strong> <span>retrylife</span></a>
<nav>
<!-- <a href="#menu">Menu</a> -->
</nav>
</header>
<!-- Menu -->
<!-- <nav id="menu">
<ul class="links">
<li><a href="http://0.0.0.0:4000//">Home</a></li>
<li><a href="http://0.0.0.0:4000/all_posts.html">All posts</a></li>
</ul>
<ul class="actions vertical">
<li><a href="#" class="button special fit">Get Started</a></li>
<li><a href="#" class="button fit">Log In</a></li>
</ul>
</nav> -->
<section id="banner" class="major" style="height:40vh">
<div class="inner">
<header class="major">
<h1>Scraping FRC team's GitHub accounts to gather large amounts of data</h1>
</header>
<div class="content">
<p >There are a lot of teams...</p>
</div>
</div>
</section>
<!-- Main -->
<div id="main" class="alt">
<!-- One -->
<section id="one">
<div class="inner">
<p><p>I was curious about the most used languages for FRC, so I build a Python script to find out what they where.</p>
<h2 id="some-basic-data">Some basic data</h2>
<p>Before we get to the heavy work done by my script, lets start with some general data.</p>
<p>Thanks to the <a href="https://www.thebluealliance.com/apidocs/v3">TBA API</a>, I know that there are 6917 registered teams. 492 of them have registered at least one account on GitHub.</p>
<h2 id="how-the-script-works">How the script works</h2>
<p>The script is split into steps:</p>
<ul>
<li>Get a list of every registered team</li>
<li>Check for a github account attached to every registered team
<ul>
<li>If a team has an account, it is added to the dataset</li>
</ul>
</li>
<li>Load each github profile
<ul>
<li>If it is a private account, move on</li>
<li>Use Regex to find all languages used</li>
</ul>
</li>
<li>Compile data and sort</li>
</ul>
<h3 id="getting-a-list-of-accounts">Getting a list of accounts</h3>
<p>This is probably the simplest step in the whole process. I used the auto-generated <a href="https://github.com/TBA-API/tba-api-client-python">tbaapiv3client</a> python librarys <code class="highlighter-rouge">get_teams_keys(key)</code> function, and kept incrementing <code class="highlighter-rouge">key</code> until I got an empty array. All returned data was then added together into a big list of team keys.</p>
<h3 id="checking-for-a-teams-github-account">Checking for a teams github account</h3>
<p>The <a href="https://www.thebluealliance.com/apidocs/v3">TBA API</a> helpfully provides a <code class="highlighter-rouge">/api/v3/team/&lt;number&gt;/social_media</code> API endpoint that will give the GitHub username for any team you request. (or nothing if they dont use github)</p>
<p>A <code class="highlighter-rouge">for</code> loop on this with a list of every team number did the trick for finding accounts.</p>
<h3 id="fetching-language-info">Fetching language info</h3>
<p>To remove the need for an Oauth login to use the script, GitHub data is retrieved using standard HTTPS requests instead of AJAX requests to the API. This gets around the tiny rate limit, but takes a bit longer to complete.</p>
<p>To check for language usage, a simple Regex pattern can be used: <code class="highlighter-rouge">/programmingLanguage"\&gt;(.*)\&lt;/gm</code></p>
<p>When combined with an <code class="highlighter-rouge">re.findall()</code>, this pattern will return a list of all recent languages used by a team.</p>
<h3 id="data-saves--backup-solution">Data saves / backup solution</h3>
<p>To deal with the fact that large amounts of data are being requested, and people might want to pause the script, I have created a system to allow for “savestates”.</p>
<p>On launch of the script, it will check for a <code class="highlighter-rouge">./data.json</code> file. If this does not exist, one will be created. Otherwise, the contents will be read. This file contains both all the saved data, and some counters.</p>
<p>Each stage of the script contains a counter, and will increment the counter every time a team has been processed. This way, if the script is stopped and restarted, the parsers will just keep working from where they left off. This was very helpful when writing the script as, I needed to stop and start it every time I needed to implement a new feature.</p>
<p>All parsing data is saved to the json file every time the script completes, or it detects a <code class="highlighter-rouge">SIGKILL</code>.</p>
<h2 id="what-i-learned">What I learned</h2>
<p>After letting the script run for about an hour, I got a bunch of data from every registered team.</p>
<p>This data includes every project (both on and offseason) from each team, so teams that build t-shirt cannons using the CTRE HERO, would have C# in their list of languages. Things like that.</p>
<p>Unsurprisingly, by far the most popular programming language is Java, with 3232 projects. These projects where all mostly, or entirely written in Java. Next up, we have C++ with 725 projects, and Python with 468 projects.</p>
<p>After Java, C++, and Python, we start running in to languages used for dashboards, design, lessons, and offseason projects. Before I get to everything else, here is the usage of the rest of the valid languages for FRC robots:</p>
<ul>
<li>C (128)</li>
<li>LabView (153)</li>
<li>Kotlin (96)</li>
<li>Rust (4)</li>
</ul>
<p>Now, the rest of the languages below Python:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>295 occurrences of JavaScript
153 occurrences of LabVIEW
128 occurrences of C
96 occurrences of Kotlin
72 occurrences of Arduino
71 occurrences of C#
69 occurrences of CSS
54 occurrences of PHP
40 occurrences of Shell
34 occurrences of Ruby
16 occurrences of Swift
16 occurrences of Jupyter Notebook
15 occurrences of Scala
12 occurrences of D
12 occurrences of TypeScript
9 occurrences of Dart
8 occurrences of Processing
7 occurrences of CoffeeScript
6 occurrences of Go
6 occurrences of Groovy
6 occurrences of Objective-C
4 occurrences of Rust
3 occurrences of MATLAB
3 occurrences of R
1 occurrences of Visual Basic
1 occurrences of Clojure
1 occurrences of Cuda
</code></pre></div></div>
<p>I have removed markup and shell languages from that list because most of them are probably auto-generated.</p>
<p>In terms of github account names, 133 teams follow FRC convention and use a username starting with <code class="highlighter-rouge">frc</code>, followed by their team number, 95 teams use <code class="highlighter-rouge">team</code> then their number, and 264 teams use something else.</p>
<h2 id="using-the-script">Using the script</h2>
<p>This script is not on PYPI this time. You can obtain a copy from my GitHub repo: <a href="https://github.com/Ewpratten/frc-code-stats">https://github.com/Ewpratten/frc-code-stats</a></p>
<p>First, make sure both <code class="highlighter-rouge">python3.7</code> and <code class="highlighter-rouge">python3-pip</code> are installed on your computer. Next, delete the <code class="highlighter-rouge">data.json</code> file. Then, install the requirements with <code class="highlighter-rouge">pip3 install -r requirements.txt</code>. Finally, run with <code class="highlighter-rouge">python3 main.py</code> to start the script. Now, go outside and enjoy nature for about an hour, and your data should be loaded!.</p>
</p>
</div>
</section>
</div>
<!-- Footer -->
<footer id="footer">
<div class="inner">
<ul class="icons">
<li><a href="https://twitter.com/ewpratten" class="icon alt fa-twitter" target="_blank"><span class="label">Twitter</span></a></li>
<li><a href="https://gitlab.com/u/ewpratten" class="icon alt fa-gitlab" target="_blank"><span class="label">GitLab</span></a></li>
<li><a href="https://github.com/ewpratten" class="icon alt fa-github" target="_blank"><span class="label">GitHub</span></a></li>
<li><a href="/feed.xml" class="icon alt fa-rss" target="_blank"><span class="label">RSS</span></a></li>
</ul>
<ul class="copyright">
<li>&copy; Evan Pratten retrylife</li>
<li>Design: <a href="https://html5up.net" target="_blank">HTML5 UP</a></li>
</ul>
</div>
</footer>
</div>
<!-- Scripts -->
<script src="http://0.0.0.0:4000/assets/js/jquery.min.js"></script>
<script src="http://0.0.0.0:4000/assets/js/jquery.scrolly.min.js"></script>
<script src="http://0.0.0.0:4000/assets/js/jquery.scrollex.min.js"></script>
<script src="http://0.0.0.0:4000/assets/js/skel.min.js"></script>
<script src="http://0.0.0.0:4000/assets/js/util.js"></script>
<!--[if lte IE 8]><script src="http://0.0.0.0:4000/assets/js/ie/respond.min.js"></script><![endif]-->
<script src="http://0.0.0.0:4000/assets/js/main.js"></script>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-74118570-2"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag('js', new Date());
gtag('config', 'UA-74118570-2');
</script>
</body>
</html>

View File

@ -102,12 +102,12 @@
<ul>
<li><h3><a href="/frc/2019/07/06/ScrapingFRCGithub.html" class="link" title="2019-07-06 11:08:00 -0400">Scraping FRC team's GitHub accounts to gather large amounts of data</a></h3></li>
<li><h3><a href="/projects/2019/07/01/devDNS.html" class="link" title="2019-07-01 18:13:00 -0400">devDNS</a></h3></li>
<li><h3><a href="/projects/2019/06/27/PWNlink.html" class="link" title="2019-06-27 13:16:00 -0400">I had some fun with a router</a></h3></li>
<li><h3><a href="/random/2019/06/27/Python.html" class="link" title="2019-06-27 03:00:00 -0400">Hunting snakes with a shotgun</a></h3></li>
</ul>
<a href="all_posts.html" class="button next">View All</a>