How to Get Started with Adopting a Federal Dataset27 Aug 2015
~THIS IS A VERY ROUGH DRAFT of a guide to help people get started with Kin Lane's Adopta.Agency project~
Tools you’ll need:
- A GitHub account
- GitHub’s desktop software for Windows or Mac (optional)
- A text editor for code (I use Sublime Text. There are many options, free and paid. Experiment until you find one you like.)
- Spreadsheet software for cleaning up datasets (Google Docs works just fine)
- Find a government dataset to adopt. You can start at data.gov or you can go directly to an agency or department and find a dataset. (This project is meant to be extensible far beyond the federal government’s open data initiative.)
- Fork the initial Adopta.Agency blueprint on GitHub.
Getting Started with GitHub:
Go to the Adopta.Agency blueprint, and “fork” the repository. This makes a copy of the repository in your account, which allows you to make changes without affecting the original repo. (You can, however, make changes that you’d like to see incorporated into a repo. This is a “pull request.”)
Your version of the Adopta.Agency blueprint will live at a URL that looks like this: http://[your username].github.io/adopta-blueprint. You can rename the repository, if you like. Or (what I recommend) is that you use this blueprint in turn as your own template for the datasets you plan to work on.
I use GitHub’s Mac app in order to work with the repositories I have hosted on the site. Some folks edit code directly via the GitHub website; some folks use the command line. Choose the method you prefer – my instructions are based on my own processes. There are many ways to do the very same thing.
Using the GitHub Mac app, I have cloned a copy of the Adopta.Agency blueprint to my computer. This means I can now make edits offline. These edits stay “local” until I sync them with the online repository. By having the files on my own computer, it also means I can open them using a text editor I prefer.
Create a new repo for the data project you want to work on. Click on "settings."
Use the "automatic page generator" feature. It's going to give you lots of options for templates. Hit "continue" and "publish." When you're returned to your GitHub repo, you'll notice it now has two branches: "master" and "gh-pages."
You're going to want to update the files in "gh-pages" in order to change the contents of the site - that is to the GitHub Pages that run your repository's website.
Now, clone this repo to your desktop, but make sure that your clone includes the "gh-pages" branch. Highlight all the files in the Adopta.Agency blueprint folder, and copy them to your new project's folder. Overwrite everything in the latter. Back in your GitHub app, hit the "commit" and "sync" button. Now your new project should have the look and template and content of the original Adopta.Agency blueprint. (You're going to edit that, don't fret.)
So let’s explore what’s inside that blueprint repo…
The blueprint uses Jekyll, which is the framework that runs GitHub Pages. Most of the files that you’ll be editing in this repository are in HTML, Markdown, or JSON. If you look at the files and folders, many of them are pretty self-explanatory: the CSS folder contains the CSS for the site; the blog folder is where blog posts chronicling the project’s updates will go; the blog.xml file generates the RSS feed; the data folder contains the data; and so on.
If you look inside the “_layouts” folder, you’ll see three files: default.html, page.html, post.html. This is a variable that’s set at the top of the HTML files that are to be displayed. Open up the default.html file, and have a peek at what’s going to appear on most of the pages associated with this repo.
You can make changes to any of these files. You can always roll back your changes if you mess something up. That’s one of the features of GitHub: version control.
The important file for you to first edit is the one called _config.yml. (YML means “yet another markup language” because programmers think they’re hilarious.) This file controls a lot of what happens in your site – the project title, the names of people in your team, the URL where the project lives, etc.
You’ll notice that some of what’s in that YML file is “commented out” – that is, it’s prefaced with a hashtag (which serves as a comment tag) so as to provide a message to the humans who read the file but not to the computers who do so. (Computers just ignore the comments.)
You’ll also notice that each sub-section starts with a command like showcase_show or api_show. This allows you to toggle on and off what you want to appear on your index.html page (that is, the page that folks will land on when they go to your project site’s “home.”)
If you look at the index.html file, you can wade through the code and see the logic behind it: depending on the “yes” or “no” you’ve put in the YML file, certain things will be “true” and will therefore appear on the landing page.
Again, feel free to experiment with any of these files and options. You can always revert to an older version. The original repository has several working examples for you to look at (such as the farmers’ markets CSV and JSON files) for you to model your own project on. But you aren’t obligated to do it all – that is, you don’t have to have an API. You don’t have to create interactive API documentation. You don’t have to maintain a blog. The blueprint is just an outline. Once you fork the project, it’s yours to push forward.
Getting Started with Cleaning Data:
So here are the main ideas that drive the Adopta.Agency project:
- Identify Data: Come up with ideas for Adopta projects, targeting specific data and topics for improvement.
- Improve Data: Acquire, clean up, organize, convert, and publish the data as simple CSV and JSON files.
- Share Data: Publish the data to Github as publicly available JSON and CSV files, and if you can, an API as well.
That improve piece is key, but tough. A lot of the datasets you’ll find are not in open formats. They’re in PDFs or Excel files, for example. And even when they are in open formats (TXT, CSV, JSON files), often the data is a mess. (Here’s one useful tool that lets you convert a CSV to JSON that'll get you started. You can convert an Excel file or Google Spreadsheet to CSV, and go from there.) More tips on cleaning up data and what to do when the government has released “open data” in some ridiculously proprietary format coming soon…