Effortlessly Sync Your Google Colab or Jupyter Notebook Data Projects with GitHub: A Step-by-Step Guide
In the world of data science and analytics, tools like Jupyter Notebook and Google Colab are our best friends for coding and analysis. However, we often overlook one crucial aspect: documentation. We’ve all been there — looking back at our code after a few months, scratching our heads, trying to decipher what we wrote, or struggling to explain it to others. Documentation, often seen as a hindsight, is critical for understanding and collaboration. Here’s where GitHub, a ‘library’ for code, becomes a game-changer, allowing us to store, track, and collaborate on our coding projects effectively.
This article is tailored for those who are new to integrating version control with their coding workflows. For this article I want to focus on the basics of pushing code from Colab to GitHub.
Why Github though?
GitHub is a go-to tool for many developers, offering a familiar platform for teams to collaborate. It’s highly probable that your company already utilizes GitHub.
Understanding the Git Workflow:
Git workflow can be summarized in the following stages
Working Directory:
- Your local workspace, typically on your laptop, where code changes are made.
- It’s often referred to as the “untracked” area in Git terminology. Changes made here are not automatically saved in the Git directory because they haven’t been communicated to Git yet.
- This untracked files can be found using command
git status
Staging Area:
- This is when Github begins to track changes, basically you are notifying github to track the changes that made to file that you sent to staging area.
- Using
git add <filename>
, files move from the working directory to the staging area. - However, any further changes made to these files after adding them to the staging area won’t be tracked unless you add these new changes to the staging area again.
- If you close your session after staging changes, these changes will still be there in the staging area when you return and reopen the session.
Local Repository:
- Once you’re satisfied with the changes in the staging area, you commit them to your local repository.
- This is done using the command
git commit -m “description of changes”
- This commit acts as a checkpoint, capturing the state of your project at that moment.
- You can create multiple commits in your local repository before deciding to send them to the remote repository.
Remote Repository:
- The remote repository is the repository (ie.,folder) hosted on GitHub.
- After you’ve committed your changes locally, you push these changes to the remote repository using
git push
command. - This action synchronizes your local repository with the remote, ensuring that all the changes you’ve committed are saved and reflected in the remote repository.
Step by Step process to push Google colab code into Github
Step1: Create a new repository on your Github
- Start by setting up a new repository on GitHub. This repository acts as a dedicated space for your project, holding all related code and files. For instance, create a repo named “Test.”
Step 2: Installing the git on your Colab notebook
! apt-get install git
Step 3: Clone the Github Repository to colab and Move to that folder
# Cloning the repo
!git clone https://github.com/<Your-user-name>/<your-repository-name>.git
We are Cloning the folder that we created on github in step 1 to our local system of google colab. As soon as you do this now you can see in the Files section on the left side that a new folder is created with the your repository name, my repo name is Test and you can see that folder.
# Changing the local location to your repository
%cd <your-repository-name>
We switched our location to repository so that we can save our changes into this folder
Step 4: Create the python file
Use %%writefile my_list.py followed by your Python code to create a new file.
# Create a new python file where you will save your code
%%writefile my_list.py
import yfinance as yf
stock = yf.Ticker("AAPL")
historical_data = stock.history(period="15mo")
historical_data.head(10)
This file is now in your local “Test” folder but not yet in the GitHub repository.
As mentioned in the work flow right now my_list.py file is in untracked area which can be found using git status
command.
Step 5: Adding changes to the Staging Area
- Stage your file for commit with
!git add <your_file_name.py>
- To view staged changes, run
!git diff —staged
Step 6: Configuring your details & Committing the Changes to Local Repo Area
- Configure your GitHub credentials in Colab.
- Commit your changes with a descriptive message using
!git commit -m “description of changes”
- use
!git log
to check the checkpoints with committed messages
As we committed the changes we can see in !git status
it is showing that there is nothing to commit.
Step 7: Creating a Github personal token and Pushing the changes to remote repo
When we are pushing the changes to remote repo we need to give our password for Authentication, Personal Access Tokens (PATs) are a secure alternative instead of using passwords for Github authentication.
To create Personal Access Token click on profile pic -> go to settings -> at the end of left side options you can see developer settings -> Under Personal access tokens -> go to Tokens (classic) -> Enter the details in note, set the expiration, and select the repo access -> Generate token -> now you can see the token
Note: Don’t share your Personal Access token with anyone
# set your remote URL and include the token
!git remote set-url origin https://<token>@github.com/your-username/your-repository-name.git
# Pushing changes to Github repository
!git push
After this I made changes to my code two more times and created commits every time, and pushed both of the commits together. Now on my repo I can see three commits
If you click on those commits you can see all the commits you made and by clicking on each of them you can see the new code that was added compared to the old code along with the comment you gave while committing the changes. This will help us understanding what is the new code that is added and what does that new code do
Other advantages from Github:
- You can revert back to any of these commits and get your old code if your new code messes up things,
- If there are multiple contributors working on this project then you can also make it mandatory that 2 or more people approve your commit before it merges into the master repository so that they also check the code you wrote and see if everything is correct
I will leave these functionality for other article.
I have added the file that I used in this article to the test repository i was using here. Link to the test repository is here.
Conclusion:
With this setup, each commit on Colab seamlessly reflects in your GitHub repository. You can track changes, revert to previous versions, and collaborate with others more efficiently. This process not only saves time but also enhances code quality and teamwork.
“Remember, integrating GitHub with Google Colab isn’t just about code storage; it’s about building a sustainable, collaborative, and understandable coding practice”.
References:
Meyer, R. P. O. C. (2022, September 20). GitHub — The Perks of Collaboration and Version Control | R-bloggers. R-bloggers. https://www.r-bloggers.com/2022/09/github-the-perks-of-collaboration-and-version-control/
Maurer, L. (2017, October 10). Git Gud: The Working Tree, Staging Area, and Local Repo. https://medium.com/@lucasmaurer/git-gud-the-working-tree-staging-area-and-local-repo-a1f0f4822018