From 43cbaf7cb4558ba0b9c2f6cf7865c8a46d92f125 Mon Sep 17 00:00:00 2001 From: Evan Pratten Date: Sat, 6 Jul 2019 19:19:02 -0400 Subject: [PATCH] frc-stats --- _posts/2019-07-06-ScrapingFRCGithub.md | 102 ++++++++ _site/all_posts.html | 7 + _site/feed.xml | 157 ++++++++----- _site/frc/2019/07/06/ScrapingFRCGithub.html | 247 ++++++++++++++++++++ _site/index.html | 4 +- 5 files changed, 461 insertions(+), 56 deletions(-) create mode 100644 _posts/2019-07-06-ScrapingFRCGithub.md create mode 100644 _site/frc/2019/07/06/ScrapingFRCGithub.html diff --git a/_posts/2019-07-06-ScrapingFRCGithub.md b/_posts/2019-07-06-ScrapingFRCGithub.md new file mode 100644 index 0000000..53e7568 --- /dev/null +++ b/_posts/2019-07-06-ScrapingFRCGithub.md @@ -0,0 +1,102 @@ +--- +layout: post +title: "Scraping FRC team's GitHub accounts to gather large amounts of data" +description: "There are a lot of teams..." +date: 2019-07-06 15:08:00 +categories: frc +--- + +I was curious about the most used languages for FRC, so I build a Python script to find out what they where. + +## Some basic data +Before we get to the heavy work done by my script, let's start with some general data. + +Thanks to the [TBA API](https://www.thebluealliance.com/apidocs/v3), I know that there are 6917 registered teams. 492 of them have registered at least one account on GitHub. + +## How the script works +The script is split into steps: + - Get a list of every registered team + - Check for a github account attached to every registered team + - If a team has an account, it is added to the dataset + - Load each github profile + - If it is a private account, move on + - Use Regex to find all languages used + - Compile data and sort + +### Getting a list of accounts +This is probably the simplest step in the whole process. I used the auto-generated [tbaapiv3client](https://github.com/TBA-API/tba-api-client-python) python library's `get_teams_keys(key)` function, and kept incrementing `key` until I got an empty array. All returned data was then added together into a big list of team keys. + +### Checking for a team's github account +The [TBA API](https://www.thebluealliance.com/apidocs/v3) helpfully provides a `/api/v3/team//social_media` API endpoint that will give the GitHub username for any team you request. (or nothing if they don't use github) + +A `for` loop on this with a list of every team number did the trick for finding accounts. + +### Fetching language info +To remove the need for an Oauth login to use the script, GitHub data is retrieved using standard HTTPS requests instead of AJAX requests to the API. This gets around the tiny rate limit, but takes a bit longer to complete. + +To check for language usage, a simple Regex pattern can be used: `/programmingLanguage"\>(.*)\ + +
  • Scraping FRC team's GitHub accounts to gather large amounts of data

  • + + + + +
  • devDNS

  • diff --git a/_site/feed.xml b/_site/feed.xml index a783ac3..987835c 100644 --- a/_site/feed.xml +++ b/_site/feed.xml @@ -1,4 +1,105 @@ -Jekyll2019-07-01T22:38:02-04:00http://0.0.0.0:4000/feed.xmlEvan PrattenComputer wizard, student, <a href="https://github.com/frc5024">@frc5024</a> programming team lead, and radio enthusiast.devDNS2019-07-01T18:13:00-04:002019-07-01T18:13:00-04:00http://0.0.0.0:4000/projects/2019/07/01/devDNS<p>Over the past year and a half, I have been hacking my way around the undocumented <a href="https://devrant.com">devRant</a> auth/write API. At the request of devRant’s creators, this API must not be documented due to the way logins work on the platform. That is besides the point. I have been working on a little project called <a href="https://devrant.com/collabs/2163502">devDNS</a> over the past few days that uses this undocumented API. Why must I be so bad at writing intros?</p> +Jekyll2019-07-06T19:18:50-04:00http://0.0.0.0:4000/feed.xmlEvan PrattenComputer wizard, student, <a href="https://github.com/frc5024">@frc5024</a> programming team lead, and radio enthusiast.Scraping FRC team’s GitHub accounts to gather large amounts of data2019-07-06T11:08:00-04:002019-07-06T11:08:00-04:00http://0.0.0.0:4000/frc/2019/07/06/ScrapingFRCGithub<p>I was curious about the most used languages for FRC, so I build a Python script to find out what they where.</p> + +<h2 id="some-basic-data">Some basic data</h2> +<p>Before we get to the heavy work done by my script, let’s start with some general data.</p> + +<p>Thanks to the <a href="https://www.thebluealliance.com/apidocs/v3">TBA API</a>, I know that there are 6917 registered teams. 492 of them have registered at least one account on GitHub.</p> + +<h2 id="how-the-script-works">How the script works</h2> +<p>The script is split into steps:</p> +<ul> + <li>Get a list of every registered team</li> + <li>Check for a github account attached to every registered team + <ul> + <li>If a team has an account, it is added to the dataset</li> + </ul> + </li> + <li>Load each github profile + <ul> + <li>If it is a private account, move on</li> + <li>Use Regex to find all languages used</li> + </ul> + </li> + <li>Compile data and sort</li> +</ul> + +<h3 id="getting-a-list-of-accounts">Getting a list of accounts</h3> +<p>This is probably the simplest step in the whole process. I used the auto-generated <a href="https://github.com/TBA-API/tba-api-client-python">tbaapiv3client</a> python library’s <code class="highlighter-rouge">get_teams_keys(key)</code> function, and kept incrementing <code class="highlighter-rouge">key</code> until I got an empty array. All returned data was then added together into a big list of team keys.</p> + +<h3 id="checking-for-a-teams-github-account">Checking for a team’s github account</h3> +<p>The <a href="https://www.thebluealliance.com/apidocs/v3">TBA API</a> helpfully provides a <code class="highlighter-rouge">/api/v3/team/&lt;number&gt;/social_media</code> API endpoint that will give the GitHub username for any team you request. (or nothing if they don’t use github)</p> + +<p>A <code class="highlighter-rouge">for</code> loop on this with a list of every team number did the trick for finding accounts.</p> + +<h3 id="fetching-language-info">Fetching language info</h3> +<p>To remove the need for an Oauth login to use the script, GitHub data is retrieved using standard HTTPS requests instead of AJAX requests to the API. This gets around the tiny rate limit, but takes a bit longer to complete.</p> + +<p>To check for language usage, a simple Regex pattern can be used: <code class="highlighter-rouge">/programmingLanguage"\&gt;(.*)\&lt;/gm</code></p> + +<p>When combined with an <code class="highlighter-rouge">re.findall()</code>, this pattern will return a list of all recent languages used by a team.</p> + +<h3 id="data-saves--backup-solution">Data saves / backup solution</h3> +<p>To deal with the fact that large amounts of data are being requested, and people might want to pause the script, I have created a system to allow for “savestates”.</p> + +<p>On launch of the script, it will check for a <code class="highlighter-rouge">./data.json</code> file. If this does not exist, one will be created. Otherwise, the contents will be read. This file contains both all the saved data, and some counters.</p> + +<p>Each stage of the script contains a counter, and will increment the counter every time a team has been processed. This way, if the script is stopped and restarted, the parsers will just keep working from where they left off. This was very helpful when writing the script as, I needed to stop and start it every time I needed to implement a new feature.</p> + +<p>All parsing data is saved to the json file every time the script completes, or it detects a <code class="highlighter-rouge">SIGKILL</code>.</p> + +<h2 id="what-i-learned">What I learned</h2> +<p>After letting the script run for about an hour, I got a bunch of data from every registered team.</p> + +<p>This data includes every project (both on and offseason) from each team, so teams that build t-shirt cannons using the CTRE HERO, would have C# in their list of languages. Things like that.</p> + +<p>Unsurprisingly, by far the most popular programming language is Java, with 3232 projects. These projects where all mostly, or entirely written in Java. Next up, we have C++ with 725 projects, and Python with 468 projects.</p> + +<p>After Java, C++, and Python, we start running in to languages used for dashboards, design, lessons, and offseason projects. Before I get to everything else, here is the usage of the rest of the valid languages for FRC robots:</p> +<ul> + <li>C (128)</li> + <li>LabView (153)</li> + <li>Kotlin (96)</li> + <li>Rust (4)</li> +</ul> + +<p>Now, the rest of the languages below Python:</p> +<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>295 occurrences of JavaScript +153 occurrences of LabVIEW +128 occurrences of C +96 occurrences of Kotlin +72 occurrences of Arduino +71 occurrences of C# +69 occurrences of CSS +54 occurrences of PHP +40 occurrences of Shell +34 occurrences of Ruby +16 occurrences of Swift +16 occurrences of Jupyter Notebook +15 occurrences of Scala +12 occurrences of D +12 occurrences of TypeScript +9 occurrences of Dart +8 occurrences of Processing +7 occurrences of CoffeeScript +6 occurrences of Go +6 occurrences of Groovy +6 occurrences of Objective-C +4 occurrences of Rust +3 occurrences of MATLAB +3 occurrences of R +1 occurrences of Visual Basic +1 occurrences of Clojure +1 occurrences of Cuda +</code></pre></div></div> + +<p>I have removed markup and shell languages from that list because most of them are probably auto-generated.</p> + +<p>In terms of github account names, 133 teams follow FRC convention and use a username starting with <code class="highlighter-rouge">frc</code>, followed by their team number, 95 teams use <code class="highlighter-rouge">team</code> then their number, and 264 teams use something else.</p> + +<h2 id="using-the-script">Using the script</h2> +<p>This script is not on PYPI this time. You can obtain a copy from my GitHub repo: <a href="https://github.com/Ewpratten/frc-code-stats">https://github.com/Ewpratten/frc-code-stats</a></p> + +<p>First, make sure both <code class="highlighter-rouge">python3.7</code> and <code class="highlighter-rouge">python3-pip</code> are installed on your computer. Next, delete the <code class="highlighter-rouge">data.json</code> file. Then, install the requirements with <code class="highlighter-rouge">pip3 install -r requirements.txt</code>. Finally, run with <code class="highlighter-rouge">python3 main.py</code> to start the script. Now, go outside and enjoy nature for about an hour, and your data should be loaded!.</p>I was curious about the most used languages for FRC, so I build a Python script to find out what they where.devDNS2019-07-01T18:13:00-04:002019-07-01T18:13:00-04:00http://0.0.0.0:4000/projects/2019/07/01/devDNS<p>Over the past year and a half, I have been hacking my way around the undocumented <a href="https://devrant.com">devRant</a> auth/write API. At the request of devRant’s creators, this API must not be documented due to the way logins work on the platform. That is besides the point. I have been working on a little project called <a href="https://devrant.com/collabs/2163502">devDNS</a> over the past few days that uses this undocumented API. Why must I be so bad at writing intros?</p> <h2 id="what-is-devdns">What is devDNS</h2> <p>devDNS is a devRant bot written in python. It will serve any valid DNS query from any user on the platform. A query is just a comment in one of the following forms:</p> @@ -377,56 +478,4 @@ https://retrylife.ca/feed.xml <audio controls=""> <source src="/assets/audio/SpamPhoneCalls.mp3" type="audio/mpeg" /> Your browser does not support audio players -</audio>I am currently taking a class in school called Music and computers (AMM2M), where as part of the class, whe get together into bands, and produce a song. After taking a break from music production for over a year, we have released our song for the class (we do two songs, but the second is not finished yet).Graphing the relation between wheels and awards for FRC2019-06-16T11:51:00-04:002019-06-16T11:51:00-04:00http://0.0.0.0:4000/frc/2019/06/16/Graphing-w2a<p>I was scrolling through reddit the other day, and came across <a href="https://www.reddit.com/r/FRC/comments/byzv5q/i_know_what_im_doing/">this great post</a> by u/<a href="https://www.reddit.com/user/MasterQuacks/">MasterQuacks</a>.</p> - -<p><img src="/assets/images/w2ainspo.jpg" alt="My insporation" /></p> - -<p>I thought to myself “ha. Thats funny”, and moved on. But that thought had stuck with me.</p> - -<p>So here I am, bored on a sunday afternoon, staring at the matplotlib documentation.</p> - -<h2 id="my-creation">My creation</h2> -<p>In only a few lines of python, I have a program that will (badly) graph the number of awards per wheel for any team, or set of teams.</p> - -<p>As always, feel free to tinker with the code. This one is not published anywhere, so if you want to share it, I would appreciate a mention.</p> - -<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">requests</span> -<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> - -<span class="k">class</span> <span class="nc">Team</span><span class="p">:</span> - <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">id</span><span class="p">,</span> <span class="n">wheels</span><span class="p">):</span> - <span class="bp">self</span><span class="o">.</span><span class="nb">id</span> <span class="o">=</span> <span class="nb">id</span> - <span class="bp">self</span><span class="o">.</span><span class="n">wheels</span> <span class="o">=</span> <span class="n">wheels</span> <span class="o">*</span> <span class="mi">2</span> - -<span class="c1">### CONFIG ### -</span> -<span class="n">teams</span> <span class="o">=</span> <span class="p">[</span><span class="n">Team</span><span class="p">(</span><span class="mi">5024</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">Team</span><span class="p">(</span><span class="mi">254</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span> <span class="n">Team</span><span class="p">(</span><span class="mi">1114</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">Team</span><span class="p">(</span><span class="mi">5406</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">Team</span><span class="p">(</span><span class="mi">2056</span><span class="p">,</span> <span class="mi">4</span><span class="p">)]</span> -<span class="n">year</span> <span class="o">=</span> <span class="mi">2019</span> - -<span class="c1">############## -</span> - -<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">team</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">teams</span><span class="p">):</span> - <span class="n">award_data</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">"https://www.thebluealliance.com/api/v3/team/frc"</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">team</span><span class="o">.</span><span class="nb">id</span><span class="p">)</span> <span class="o">+</span> <span class="s">"/awards/"</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">year</span><span class="p">),</span> <span class="n">params</span><span class="o">=</span><span class="p">{</span><span class="s">"X-TBA-Auth-Key"</span><span class="p">:</span> <span class="s">"mz0VWTNtXTDV8NNOz3dYg9fHOZw8UYek270gynLQ4v9veaaUJEPvJFCZRmte7AUN"</span><span class="p">})</span><span class="o">.</span><span class="n">json</span><span class="p">()</span> - - <span class="n">awards_count</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">award_data</span><span class="p">)</span> - - <span class="n">team</span><span class="o">.</span><span class="n">w2a</span> <span class="o">=</span> <span class="n">awards_count</span> <span class="o">/</span> <span class="n">team</span><span class="o">.</span><span class="n">wheels</span> - <span class="k">print</span><span class="p">(</span><span class="n">team</span><span class="o">.</span><span class="nb">id</span><span class="p">,</span> <span class="n">team</span><span class="o">.</span><span class="n">w2a</span><span class="p">)</span> - - <span class="n">plt</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">team</span><span class="o">.</span><span class="n">w2a</span><span class="p">,</span> <span class="n">tick_label</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">team</span><span class="o">.</span><span class="nb">id</span><span class="p">))</span> - -<span class="c1"># Plot -</span><span class="n">x_lables</span> <span class="o">=</span> <span class="p">[</span><span class="n">team</span><span class="o">.</span><span class="nb">id</span> <span class="k">for</span> <span class="n">team</span> <span class="ow">in</span> <span class="n">teams</span><span class="p">]</span> -<span class="c1"># plt.set_xticklabels(x_lables) -</span> -<span class="k">with</span> <span class="n">plt</span><span class="o">.</span><span class="n">xkcd</span><span class="p">():</span> - <span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Awards per wheel'</span><span class="p">)</span> - <span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span> - -</code></pre></div></div> - -<h2 id="the-result">The result</h2> -<p>Here is the resulting image. From left, to right: 5024, 254, 2224, 5406, 2056</p> - -<p><img src="/assets/images/w2a.png" alt="Thr result" /></p>I was scrolling through reddit the other day, and came across this great post by u/MasterQuacks. \ No newline at end of file +</audio>I am currently taking a class in school called Music and computers (AMM2M), where as part of the class, whe get together into bands, and produce a song. After taking a break from music production for over a year, we have released our song for the class (we do two songs, but the second is not finished yet). \ No newline at end of file diff --git a/_site/frc/2019/07/06/ScrapingFRCGithub.html b/_site/frc/2019/07/06/ScrapingFRCGithub.html new file mode 100644 index 0000000..ffd2358 --- /dev/null +++ b/_site/frc/2019/07/06/ScrapingFRCGithub.html @@ -0,0 +1,247 @@ + + + + + Evan Pratten + + + + + + + + + + + + + + + +
    + + + + + + + + + + + +
    + + +
    +
    + +

    I was curious about the most used languages for FRC, so I build a Python script to find out what they where.

    + +

    Some basic data

    +

    Before we get to the heavy work done by my script, let’s start with some general data.

    + +

    Thanks to the TBA API, I know that there are 6917 registered teams. 492 of them have registered at least one account on GitHub.

    + +

    How the script works

    +

    The script is split into steps:

    +
      +
    • Get a list of every registered team
    • +
    • Check for a github account attached to every registered team +
        +
      • If a team has an account, it is added to the dataset
      • +
      +
    • +
    • Load each github profile +
        +
      • If it is a private account, move on
      • +
      • Use Regex to find all languages used
      • +
      +
    • +
    • Compile data and sort
    • +
    + +

    Getting a list of accounts

    +

    This is probably the simplest step in the whole process. I used the auto-generated tbaapiv3client python library’s get_teams_keys(key) function, and kept incrementing key until I got an empty array. All returned data was then added together into a big list of team keys.

    + +

    Checking for a team’s github account

    +

    The TBA API helpfully provides a /api/v3/team/<number>/social_media API endpoint that will give the GitHub username for any team you request. (or nothing if they don’t use github)

    + +

    A for loop on this with a list of every team number did the trick for finding accounts.

    + +

    Fetching language info

    +

    To remove the need for an Oauth login to use the script, GitHub data is retrieved using standard HTTPS requests instead of AJAX requests to the API. This gets around the tiny rate limit, but takes a bit longer to complete.

    + +

    To check for language usage, a simple Regex pattern can be used: /programmingLanguage"\>(.*)\</gm

    + +

    When combined with an re.findall(), this pattern will return a list of all recent languages used by a team.

    + +

    Data saves / backup solution

    +

    To deal with the fact that large amounts of data are being requested, and people might want to pause the script, I have created a system to allow for “savestates”.

    + +

    On launch of the script, it will check for a ./data.json file. If this does not exist, one will be created. Otherwise, the contents will be read. This file contains both all the saved data, and some counters.

    + +

    Each stage of the script contains a counter, and will increment the counter every time a team has been processed. This way, if the script is stopped and restarted, the parsers will just keep working from where they left off. This was very helpful when writing the script as, I needed to stop and start it every time I needed to implement a new feature.

    + +

    All parsing data is saved to the json file every time the script completes, or it detects a SIGKILL.

    + +

    What I learned

    +

    After letting the script run for about an hour, I got a bunch of data from every registered team.

    + +

    This data includes every project (both on and offseason) from each team, so teams that build t-shirt cannons using the CTRE HERO, would have C# in their list of languages. Things like that.

    + +

    Unsurprisingly, by far the most popular programming language is Java, with 3232 projects. These projects where all mostly, or entirely written in Java. Next up, we have C++ with 725 projects, and Python with 468 projects.

    + +

    After Java, C++, and Python, we start running in to languages used for dashboards, design, lessons, and offseason projects. Before I get to everything else, here is the usage of the rest of the valid languages for FRC robots:

    +
      +
    • C (128)
    • +
    • LabView (153)
    • +
    • Kotlin (96)
    • +
    • Rust (4)
    • +
    + +

    Now, the rest of the languages below Python:

    +
    295 occurrences of JavaScript
    +153 occurrences of LabVIEW
    +128 occurrences of C
    +96 occurrences of Kotlin
    +72 occurrences of Arduino
    +71 occurrences of C#
    +69 occurrences of CSS
    +54 occurrences of PHP
    +40 occurrences of Shell
    +34 occurrences of Ruby
    +16 occurrences of Swift
    +16 occurrences of Jupyter Notebook
    +15 occurrences of Scala
    +12 occurrences of D
    +12 occurrences of TypeScript
    +9 occurrences of Dart
    +8 occurrences of Processing
    +7 occurrences of CoffeeScript
    +6 occurrences of Go
    +6 occurrences of Groovy
    +6 occurrences of Objective-C
    +4 occurrences of Rust
    +3 occurrences of MATLAB
    +3 occurrences of R
    +1 occurrences of Visual Basic
    +1 occurrences of Clojure
    +1 occurrences of Cuda
    +
    + +

    I have removed markup and shell languages from that list because most of them are probably auto-generated.

    + +

    In terms of github account names, 133 teams follow FRC convention and use a username starting with frc, followed by their team number, 95 teams use team then their number, and 264 teams use something else.

    + +

    Using the script

    +

    This script is not on PYPI this time. You can obtain a copy from my GitHub repo: https://github.com/Ewpratten/frc-code-stats

    + +

    First, make sure both python3.7 and python3-pip are installed on your computer. Next, delete the data.json file. Then, install the requirements with pip3 install -r requirements.txt. Finally, run with python3 main.py to start the script. Now, go outside and enjoy nature for about an hour, and your data should be loaded!.

    +

    +
    +
    + +
    + + + + +
    + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/_site/index.html b/_site/index.html index 705ae7f..0e195de 100644 --- a/_site/index.html +++ b/_site/index.html @@ -102,12 +102,12 @@