Bootstrap an independent data scraper rileyjshaw/rileyjshaw-new

Project scraper

The projects on my site are automatically scraped and formatted at publish time
using the scripts in this directory. Read more about my reasoning below, or
skip to the directory structure.

Why?

Gatsby's source and transformer plugins are powerful, and I used them in the
initial development of this site. I eventually decided that separating my
collection process would be good for flexibility, control, and offline work.

Flexibility

GraphQL's filters and transforms are powerful, and Gatsby's APIs add more
options for how data is fetched, cached, and transformed. However, complicated
or non-standard data transforms and sanitization are much easier outside of
Gatsby's ecosystem. For instance, the API starts to feel clunky for one-off
treatment of specific content nodes,

Control

I've had a good experience with Gatsby but I may decide to migrate my site to
another platform or format someday. Keeping my data entirely separate from the
from the site's framework makes migrating my data as easy as copy/pasting this
directory. It's just a few JS files!

Offline

Gatsby stores requests made through its source plugins in the .cache
directory by default. The .cache directory is deleted after:

  • gatsby clean is called.
  • package.json changes, for example a dependency is updated or added.
  • gatsby-config.js changes, for example a plugin is added or modified.
  • gatsby-node.js changes, for example if a new Node API is invoked.
  • …etc.

I found I was frequently triggering .cache wipes during development. At best
this meant I was pinging APIs and atom feeds more than necessary. At worst, it
made working offline with project data impossible.

Directory structure

Here's how the scraper is organized for now:

scrape-projects.js
	The megafile to replace Gatsby's source plugins. This pulls project data
	from all online sources and saves them into `_generated/`.

_generated/
	Files generated by the `scrape-projects.js` above. DO NOT EDIT THESE FILES
	MANUALLY! They will be overwritten.

	scraped-projects-raw.json
		Not quite the raw response, but pretty close. This file
		contains all the data that I may decide to use someday, but
		haven't yet. Organized by `type` in a nested object.

    scraped-projects-formatted.json
		Standardized into a smaller format that can be smashed together with
		`curation/` data. Flattened into an array with `type` annotations on
		each node, as well as unique, unchanging project IDs (`UID`).

curation/
	This is where all custom curation and processing go, eg. tagging content.
	Projects are modified based on their generated UID.

	tweaks.js
		Mainly for one-off changes eg. fixing formatting errors from immutable
		online sources. This file can also be used to apply changes on groups
		of files.

	tags.js
		TODO: figure out where `tags`, `lastTagged`, and `coolness` data are
		going to live.

sources/
	Offline data files and collections to compliment the online data cached in
	`_generated/`.

	standalone-projects.json
		TODO: Move these over from the `src/data` directory.

tools/
	Custom tools to help classify, organize, or edit project nodes without
	opening a text editor. Custom tools are only built for data that is too
	difficult to keep updated or standardized manually.
	TODO: Hook these up to a Node server so they edit the JSON files directly.

	tagger.html
		Finds untagged or incorrectly tagged projects, as well as projects
		that were last tagged before a new tag type was added. Provides an
		interface to preview and re-tag each project.

	cool-sort.html
		TODO: sort or insert nodes based on their "coolness".

test/
	Quick test files to ensure data is downloaded without any dropped nodes,
	UIDs are unique, etc.