EDS 296: Data Science Portfolios

GitHub Tools


November 8th, 2024

Demonstrate by doing


GitHub provides many cool project management features that facilitate organization, collaboration, coding, and building workflows. By using these tools (while working solo and while working with others), you can demonstrate your technical and programming proficiencies.

Four little monsters on grass support another monster starting their climb up a rock face. The climber’s harness is labeled 'Coder', the belayer wears a harness labeled 'Code review', two others consulting a book and route map wear caps labeled 'Documentation' and 'Reuse', and another brings a box labeled 'Project management and snacks'.

Track ideas & TODOs using GitHub issues

Add a new issue from a repo’s Issues tab


“GitHub Issues are items you can create in a repository to plan, discuss and track work. Issues are simple to create and flexible to suit a variety of scenarios. You can use issues to track work, give or receive feedback, collaborate on ideas or tasks, and efficiently communicate with others.”

A few helpful features:

When should I use issues?



Issues are a useful and valuable tool for tracking TODOs, jotting down ideas, recording bugs, etc., regardless of whether you’re working alone or with collaborators.


Like most things, it’s great to put some care and thought into writing an issue (especially when collaborating or contributing thoughts to a public project e.g. ggplot2). . .


. . . but I’d also argue that a hasty issue on a personal project can still go a long way in helping you remember a helpful resource, or that idea that popped into your mind at a time you couldn’t devote much attention to it.

Organize and prioritize issues (and pull requests) using GitHub projects

Create a project from your profile’s Projects tab


“Projects are an adaptable collection of items that stay up-to-date with GitHub data. Your projects can track issues, pull requests, and ideas that you note down. You can add custom fields and create views for specific purposes.”

A few helpful features:

When should I use projects?



If you use issues, projects may offer an additional helpful way to organize your tasks.


Code is oftentimes spread across multiple repositories (capstones & GPs are an excellent examples of this!) – projects can be particularly helpful for tracking TODOs and progress across them.


Projects are not required – you can decide if they are a helpful tool for you / your team.

Collaborate with teams across shared projects (repos) using GitHub organizations

Create an organization from your GitHub profile


“Organizations are shared accounts where businesses and open-source projects can collaborate across many projects at once, with sophisticated security and administrative features.”


Click on the Create new (“+”) button or by clicking on your profile image (top right corner) > Your organizations > New organization. Choose the “free” option.


A few helpful features:

When should I use organizations?



GitHub organizations are extremely helpful when collaborating with teams of people within a company / group. Benefits include:


  • creating a professional image for a team or project with a unique name, brand, and / or website (MEDS Capstone projects should all have an organization e.g. Outdoor Equity (MEDS 2022), PYFOREST (MEDS 2023), CASAschools (MEDS 2024))
  • centralized management of repositories, members, and settings
  • fine-grain access control where organization owners can assign roles to individual members and members to teams, providing different levels of control over repository access

Host reports, documents, websites, etc. with GitHub Pages

GitHub Pages can be enabled for any repo


You can host one website or rendered HTML document from any public GitHub repository.

Hosting additional websites via GHP:

  • same process as hosting your personal website (see week 0 materials)
    • NOTE: you’re only allowed one user website with the github.io suffix, e.g. <username>.github.io – all other URLs will be structured as <username>.github.io/<repoName>)

Hosting a rendered HTML document via GHP:

  • e.g. Quarto doc, Quarto presentation, Quarto manuscript…etc.
  • rendered file must be named index.html and live in your repo’s root directory; no other configurations necessary
  • deploy from Settings > Pages > set branch to main and directory to root
  • re-render index.html, then push to GitHub to update deployment


The materials for creating your personal website using Quarto (above) is one example of a published document. Similarly, the slides for customizing your Quarto website are also published using GitHub Pages.

When should I use GitHub Pages?



You can use GitHub Pages to host projects and resources that you want to share publicly with others (e.g. colleagues, clients, potential employers, etc.).


GitHub Pages can be enabled from any public repo owned by a personal profile or organization.


Consider hosting instructional documentation, software user guides, reports, project websites, etc.

Automate workflows with GitHub Actions

What is GitHub Actions?


“GitHub Actions (GHA) is a 1continuous integration and 2continuous delivery (CI/CD) platform that allows you to automate your build, test, and deployment pipeline.”


You can use GHA to automate pretty much anything (truly)! But some concrete examples:

  • building and deploying a GitHub Pages site whenever changes are pushed or merged into main
  • adding appropriate labels whenever someone creates a new issue in your repo
  • running tests whenever code is pushed to a repo to ensure new changes don’t break existing code
  • running data analysis pipelines whenever new data is pushed to the repo
  • running linters to ensure code adheres to a particular style guide

Some definitions


  • workflow: a configurable automated process that will run one or more jobs, which are defined by a YAML file in your repo (e.g. automatically build and redeploy a GitHub Pages site)
  • event: a specific activity in a repo that triggers a workflow run (e.g. pushing modified files to GitHub)
  • job: a set of steps in a workflow that is executed on the same runner; steps are executed in order (e.g. check out repo, install Quarto, install R + dependencies, render & publish Quarto project)
  • runner: a server that runs your workflows when they’re triggered


(right) An example GitHub Actions workflow for building and deploying a Quarto website.

(left) The repository’s Actions tab, where you can monitor the status of your GitHub Actions workflows.

When should I use GitHub Actions?



Consider using GitHub Actions whenever you want to automate tasks like building, testing, and deploying code from your GitHub repository.


Setting up a GHA workflow from scratch can be a bit intimidating, so it’s great to make use of workflow templates, which can be used as-is or modified for your custom workflow. You can sometimes also find templates provided by other tools for automating specific tasks (e.g. Quarto provides templates for executing R or Python code and rendering output to GitHub Pages).


It can be helpful to read a bit more about the YAML syntax used in workflow files before diving into creating or modifying your own workflows.

BONUS: An example GHA workflow for automating Quarto website builds and deployments

Let’s demonstrate with mysite (from week 0)



The next few slides walk through setting up a GitHub Action that automates the building and deployment of a basic Quarto website, which may contain R code (e.g. rendered as part of a blog post). Up until now, I’ve been manually building mysite locally, then pushing the rendered files (in the docs/ folder) to GitHub, where GitHub Pages deploys from.


Rather than building a workflow from scratch, we’ll use a workflow template provided in the Quarto documentation.


We’ll follow these general steps:

  1. Set up a virtual environment for our project
  2. Create a gh-pages branch, where our rendered website files will be stored
  3. Add a GHA workflow to our repository
  4. Reconfigure GitHub Pages to serve our website from our gh-pages branch

NOTE: GHA workflows will (likely) take longer than local builds



Whenever an event triggers a GHA workflow, GitHub spins up a virtual machine (i.e. a runner) where our defined jobs are executed. You can think of a runner as a brand new (mostly blank slate) computer. We’ll make use of a GitHub-hosted runner, though you can host your own runners.


You must provide all necessary pieces of software to this runner (e.g. R, RStudio, Quarto, repo code, etc.). You do so in your workflow script.


As a result, an automated GHA workflow will take more time to complete than if you were to build your website locally, then push all rendered files (in a docs/ folder) to GitHub for GitHub Pages to deploy. This is, in part, because you already have all necessary pieces of software for rendering your website installed on your local machine.

1. Set up a virtual environment



We’ll want to set up a virtual environment for our project to ensure that our code is reproducible across different machines (including both our local development environment and the GitHub-hosted runner, where our workflow will be executed). Here, we’ll do so using the {renv} package.


Steps:

  1. install {renv} (if necessary)
  2. run renv::init() to initialize renv in our existing Quarto project
  3. say Y when it asks if you want to proceed
  4. install {yaml}, if prompted

2. Create a gh-pages branch



The gh-pages branch is a special branch that you can use to store your built website (i.e. only your website’s rendered files, not the source files (e.g. any .qmds)). We’ll eventually configure GitHub Pages to deploy our website from this gh-pages branch.


Steps:

  1. return to your GitHub repository
  2. click on branch drop down button > click View all branches > click the green New branch button > in the New branch name field, type gh-pages > click the green Create new branch

3. Add a GHA workflow to the repository


Rather than building a workflow from scratch, we can use the one conveniently provided in the Quarto documentation!


Steps:

  1. return to RStudio
  2. add a folder named .github in your root directory (you can use the Terminal or the New Folder button)
  3. inside .github/ create another folder called workflows
  4. inside workflows/ add a file named publish.yml (you can use the Terminal or New File > Text file button)
  5. copy the workflow content from the Quarto documentation into publish.yml
  6. remove (or comment out) output-dir: docs from _quarto.yml
  7. delete the docs/ folder (we’ll no longer be rendering to / deploying from this directory)
  8. add, commit, and push everything to GitHub

4. Reconfigure GitHub Pages



After pushing your updated files to GitHub, you will probably receive an automated email with the Subject line: [yourUserName/repoName] Run failed: pages build and deployment – this is because GitHub Pages is currently looking to redeploy your site from the docs/ folder (we’ve just removed this) on the main branch.

Our final step is to reconfigure our GitHub Pages to now serve our rendered website from the gh-pages branch.


Steps:

  1. return to your GitHub repository
  2. go to Settings > Pages > switch branch from main to gh-pages and from /docs to /(root) > click Save > check out Actions tab while your site redeploys (remember, this will take a bit longer than you’re used to!)

Try out your GHA!



Test it out!

  1. make a change to your website (this could be as minor as adding some text to one of your pages)
  2. add, commit, push your modified files (NOTE: do not build your website!)
  3. head to your repository’s Actions tab to watch the status of your rebuild / redeployment


Your Action will also be triggered if you merge a pull request into main.

Take a Break

~ Portfolio share outs next! ~

05:00