I’m releasing some code that could get you in some mild trouble if you use it. It’s nothing groundbreaking - just a run-of-the-mill scraper written with nodejs that grabs your data from Yelp and gives it to you either as JSON and HTML formatted with hReview.
Dear Yelp: it only fetches one user’s data - not everyone’s. No need to worry about evil people stealing data, if you really tried to use this tool to that, it’d do a terrible and incomplete job. It’s for users who want their data.
Yelp has an API. It’s right here, but, in the words of a Yelp employee, it’s made for businesses - it doesn’t have a method to get your data.
The problem is the difference between what lawyers write about technology and what they write about copyright.
As between you and Yelp, you own Your Content. (5C: Content Ownership)
Yelp is reasonable about copyright; like many other services, they claim perpetual rights to use your content pretty much however they like, and the aggregates of your star rating with everyone elses isn’t “everyone’s” - it’s owned by Yelp. Fair: aggregation is what they do.
But you own the © on your data - what if you want it?
You also agree not to, and will not assist, encourage, or enable others to: use any robot, spider, site search/retrieval application, or other automated device, process or means to access, retrieve, scrape, or index any portion of the Site or any Site Content; (6B, part iii)
You can write your reviews and post them on Yelp, but there is no way - API or scraping - that you can legally copy them from Yelp, except by visiting each page and copy & pasting. For me, this is a deterrent to contributing to Yelp, even if it’s tepid reviews of coffeeshops. And since I’ve found DC’s best, there’s not much to say there.
So I get it: companies see user-generated data as their competitive advantage. If anyone could get a MySQL dump of Yelp, there’d be lots of competitors who are ‘unfairly’ advantaged by having the work done for them. Yelp has competition, like Google Places, Foursquare and the like, and needs to manage how they reuse and its content.
But that’s not the point: a website inviting contributions but lacking an export API isn’t good enough for conscientious or creative users. In this case, over-eager legal terms really limit the potential of site.
The first iteration was in node.io and CoffeeScript, but I rebuilt it with cheerio, a great implementation of jQuery’s essentials along with a relaxed parser. And instead of request, I used the library that I wrote, and that still powers TileMill and some other work projects - node-get.
This really isn’t a significant amount of code: maybe 50 LOC total, and an hour less time to build.
Scrapers rarely work on more than one site, and abstracting the process rarely yields results. This makes them a nice to do every once in a while: it tends to help out a lot of less-technical people to give them the ability to export the data that they own, but is hard to pull out.