Scraping and parsing Goodreads user books data in Node.js

Published on
Published on
/3 mins read/---

Since Goodreads no longer supports fetching user's books data via their API, I've decided to scrape user's book data using the RSS feed and parse it in Node.js.

Goodreads API

The idea here is to use the rss-parser package to parse the RSS feed and extract the book data.

goodreads.ts
import Parser from 'rss-parser'
import type { GoodreadsBook } from '~/types'
 
let parser = new Parser<{ [key: string]: any }, GoodreadsBook>({
  customFields: {
    // Define all the custom fields you want to extract from the RSS feed
    // Here I'm listing all the available fields from the Goodreads RSS feed
    item: [
      'guid',
      'pubDate',
      'title',
      'link',
      'book_id',
      'book_image_url',
      'book_small_image_url',
      'book_medium_image_url',
      'book_large_image_url',
      'book_description',
      'author_name',
      'isbn',
      'user_name',
      'user_rating',
      'user_read_at',
      'user_date_added',
      'user_date_created',
      'user_shelves',
      'user_review',
      'average_rating',
      'book_published',
    ],
  },
})

Then you can fetch the data from the RSS feed using the parser object, and process it as needed.

goodreads.ts
const GOODREADS_RSS_FEED_URL = '<YOUR_GOODREADS_RSS_FEED_URL>'
 
export async function fetchGoodreadsBooks() {
  if (GOODREADS_RSS_FEED_URL) {
    try {
      let data = await parser.parseURL(GOODREADS_RSS_FEED_URL)
      // All the books data will be stored in the `data.items` array
      // Use the parsed data as needed, for example, you can write it to a JSON file:
      writeFileSync(`./json/books.json`, JSON.stringify(data.items))
    } catch (error) {
      console.error(`Error fetching the Goodreads RSS feed: ${error.message}`)
    }
  } else {
    console.log('📚 No Goodreads RSS feed found.')
  }
}

NOTE

You can get a Goodreads user's RSS feed URL by going to their profile and navigating to the bookshelf page and copy the RSS feed URL. This is my bookshelf page for example: https://www.goodreads.com/review/list/179720035

Now that you have the data you might need to prettify them before storing or using in your application since the data is stored in a raw format.

types.ts
let data = await parser.parseURL(/* GOODREADS_RSS_FEED_URL */)
// Loop through the `data.items` array to prettify the data
for (let book of data.items) {
  book.content = book.content.replace(/\n/g, '').replace(/\s\s+/g, ' ') // Remove line breaks
  book.book_description = book.book_description
    .replace(/<[^>]*(>|$)/g, '') // Remove HTML tags
    .replace(/\s\s+/g, ' ') // Replace multiple spaces with a single space
    .replace(/^["|“]|["|“]$/g, '') // Remove leading and trailing quotation marks
    .replace(/\.([a-zA-Z0-9])/g, '. $1') // Add a space after a period
}
// Use the parsed and prettified data as needed...

The GoodreadsBook type is defined here:

types.ts
export type GoodreadsBook = {
  guid: string
  pubDate: string
  title: string
  link: string
  book_id: string
  book_image_url: string
  book_small_image_url: string
  book_medium_image_url: string
  book_large_image_url: string
  book_description: string
  author_name: string
  isbn: string
  user_name: string
  user_rating: string
  user_read_at: string
  user_date_added: string
  user_date_created: string
  user_shelves: string
  user_review: string
  average_rating: number
  book_published: string
  content: string
}

Caveat

The Goodreads RSS feed is not updated instantly if you update your books on Goodreads. You might need to wait for a few hours before you can fetch the latest data.

Happy scraping!