Introduction to Git

Photo by Eva Waardenburg Photography on Unsplash

git is a tool used for software development. It supports version controls and allows multiple people to work on the same codebase in parallel. Most Data teams will use Git as a part of their regular workflow. This lesson will focus on GitHub, a commonly used web platform for Git.

Getting Started with Github

  • Install git locally as a commandline tool
  • Create a Github account (the free version is enough for most purposes)
  • Connect device sto account

  • Git / GitHub Components

    Repository (Repo)

    A repository (repo) is a directory hosted on GitHub with a log history detailing any content changes. Each log entry is called a commit, which is made up of content changes, a user created message summarizing these changes and a unique identifier called a SHA.


    Say that I find a typo in the linear regression lesson on this website. I will correct the typo, add the changes, write a commit message (for instance, "correcting typo in linear regression") and push these changes to our GitHub repo.

    The repo logs is updated with a new entry that has SHA 6y7laec5c372b366f1c4e1f0a55947c718a81a9 (a randomly generated ID), the message "correcting typo in linear regression" and the updates I made in linear_regression.pug .

    Local vs Remote

    Let's use this website as an example again. The codebase is hosted on GitHub, but also on each contributor's device. The copy on our devices are the local repositories while the one on GitHub is the remote repository.

    If I made a change on the local repository, the website will not be affected. I would have to push the changes from local to remote in order to update the actual website.

    Remote repositories are referenced by their names. The default name is origin but that can also be customized.

    Feature and Master Branch

    Branches allow multiple people to work on the website in parallel. Each branch should contain a separate feature. The master branch (which is the default branch created alongside the repo) should be protected and updated only after code changes have been thoroughly reviewed and tested. Code in our master branch is the one being deployed -- any changes made to master will be reflected on the website. "Master" is a commonly used naming convention; depending on the setup, you can choose to sync whichever branch with the end product.

    If I wanted to work on a new feature for the website (for instance, a new lesson), I will take the current master branch and create a new feature branch from it (essentially as a copy). I will make the necessary changes and make a formal proposal (a Pull Request) to merge these changes into master.

    Pull Requests / Code Reviews

    Each pull request contains a unique identifying number, a description of what was changed (and why the changes were necessary) along with a visual diff of the changes. Merging the pull request folds the changes into the master branch.

    In most cases, at least one approval from a fellow collaborator is required to merge the pull request. Reviewers can approve the pull request from the UI, request changes or leave comments.

    Reviewers can also pull the changes from the feature branch to view it on their local device.

    You can also refer to the official GitHub glossary for any Git related jargons.

    Git commands

    For a more comprehensive list of Git commands, visit the official Git cheatsheat.

    Check Status of Local Repository

    This shows which files were changed.

    git status
    # get the change differences for all files
    git diff
    # get the change differences for a specific file
    git diff [file name] 

    Stage Changed Files to be Committed

    git add [name_of_files]
    # add all changed files
    git add .
    # add tracked files (previously added files) 
    git add -u

    Commit Staged Changes

    git commit -m "Enter your message here"

    Reset (Undo) Commited Changes

    # undo commits after given commit
    git reset [commit SHA]
    # undo and discard changes after given commit
    git reset --hard [commit SHA]

    Sync Changes

    # get changes from remote origin
    git pull origin [branch name]
    # push changes to remote origin
    git push origin [branch name]

    Switch Branches

    # create a new local branch
    git checkout -b [name_of_branch]
    # switch to another local branch
    git checkout [name_of_branch]

    Thanks for reading our Git lesson

    Here are some additional reading(s) that may be helpful:

  • For a more comprehensive list of Git commands, visit the official Git cheatsheat

  • « Previous: Introduction to SklearnTutorialsNext: Learning to love your Code Editor »