Recovering a blog… for dummies

Part 1 – Dissolution

In 2023 I let my domain registration and hosting services lapse. I had owned jonmcosgrove.name since about 2013, and babylonman.com for even longer, perhaps since about 2010. Somewhere in there I also registered joncosgrove.com. All of which at one point or another was the dns pointing to my WordPress blog hosted at Bluehost.

For several years my activity was low but consistent. Then a few years ago posting consistency declined. The trend is clear in the post history.

A variety of things contributed to this, but the notable drivers were:

- - 02/2019 – Jane – less time
  - 08/2021 – Max – even less time

Finally, after about 10 years of operating a WordPress blog of one kind or another, I let the hosting and domain registrations lapse in early 2023. I was confident in my backups and even remember running my FTP sync from the server a few months before the lapse of services.

Part 2 – Rebirth

This past winter / spring my interest was rekindled. Partly this was related to having a bit more headspace with the kids getting older. Partly it was inspired by trips and major events. So I decided to kickstart the old blog again, but then a long hidden disaster was revealed.

The crucial missing piece to any WordPress blog, the database, was not actually preserved. My local backups were corrupted, somehow. My automated backup service, setup a decade ago, was pointing to a long out date storage provider. And my FTP sync while it did preserve media and configuration files, did not include the actual post text or other database content.

This realization was slow to materialize. I was sure that I had a good, recent backup. How could I not? But source by source, while attempting to revive and recover my blog amid renewed interest, I slowly realized my posts were gone.

And so we have this post, about content recovery and the process to restore my 47 posts of rambling, which hold a special place in my internet heart.

Part 3 – Archaeology

Ok, I need the post content. I have the media files, images and video, albeit in a disorganized manner. The exact layout and photo libraries are not so important to me. So with this in mind I set out to find the original post content.

I’ve long known about the wayback machine, or more formally web.archive.org. And I even knew it had touched my site in the past, as I’ve stumbled across some of the saved snapshots over the years. This proved to be a valuable resource to restore some posts but it didn’t have every post. The snapshots includes text, image positions, and associated image file names via links, though not the actual images.

I’ve also been aware of large scale web crawling projects, mainly due to their association with recent AI projects. I was confidently hopeful that one of these would have hit upon my blog and at least preserved the text.

Unfortunately, after some digging I watched with disappointment as my queries to common crawl’s URL tables returned nothing for any of my past domains. What I thought was a sure bet turned up nothing.

Lastly, I still hadn’t uncovered every local storage stone that may be hiding some post. Searching some old dis-used Macs, a few external drives I refuse to throw away each turned up some archives from sites, but nothing fruitful. I believe I must have had something misconfigured in my backup plugin as all of the database files seem to be unusable.

At this point I was getting pretty down, having only recovered a handful of posts. But then one morning I came upon the thought of iPhone backups. I used to use the WordPress app and upon restoring a few on my phone I found a treasure trove! About half my post’s text was preserved in the app data, hooray!

Part 4 – Reconstruction

The reconstruction portion was straightforward, if a bit tedious. I captured the posts and their original post date and quickly generated new entries on a fresh AWS instance. Preserving the post dates was important to me and took a bit of time to review each one. The photos took a while as well. Some of the recovered text contained modified (in the case of the web.archive) links or references to varying domains. A few relaxed days over Thanksgiving break though and I managed to get everything I had up to date photo wise. I stayed true to the originals as far as I can tell thanks to good backups of the media library content, and its date sorted directory structure which allowed me to associate images and posts chronologically.

One area of inconsistency is the feature image. For the most part I had no record of these. A few I remember clearly but for many I’m just guessing.

Another disappointment is that not all posts have been recovered. In several cases I have titles, but no content. Where this is the case I’ve created a new category for the blog: Vanished without a trace. I hope to reduce this list over time but as of this writing there are 10 orphaned posts:

List of the forgotten:

For each of these I’ve generated a tribute, a ASCII art scene, unique to the post in lieu of the content.

And now a few notes on the process:

How to restore text:

Method 1: via web.archive.org

1. 1. see https://web.archive.org/web/*/jonmcosgrove.name*

Method 2: via iPhone backups

1. 1. I used to have the WordPress app on my phone
  2. iPhone backups include apps and their data
  3. post body text from 2014 – Jan 2018 is contained in backups, alas no pictures

Method 3: via commoncrawl.org

1. 1. I was very hopeful on this… commoncrawl is a project to scrape most of the web
  2. data is used for LLM training
  3. unfortunately a query of URLs present in the dataset returned nothing for:
    1. jonmcosgrove.name
    2. jonmcosgrove.com
    3. babylonman.com
  4. BUST!

Method 4: via email

1. 1. My doting fans! Of course, they sign up for emails!
  2. Just ask them to search based on post title and forward the content.
  3. BUST! Because I use the “read more” tag on most posts, even in the best case only part of the content is available.

How to restore images:

Method 1: via web.archive.org

Copy code for all images including href IMG_XXXX – sample of one:

<img class="alignnone size-medium wp-image-683" src="http://www.cosgrove.blog/wp-content/uploads/2020/01/IMG_2558-300x225.jpeg" alt="" width="300" height="225" /></code></blockquote>
<!-- wp:image {"linkDestination":"custom"} -->
<figure class="wp-block-image"><a href="https://web.archive.org/web/20201127102644/https://www.jonmcosgrove.name/s-is-for-san-francisco-bay-area-ca/img_2558/"><img src="https://web.archive.org/web/20201127102644im_/https://i0.wp.com/www.jonmcosgrove.name/blog/wp-content/uploads/2020/01/IMG_2558.jpeg?resize=150%2C150&amp;ssl=1" alt="" /></a></figure>
<!-- /wp:image -->

Parse with regex to only get file names:
1. ```
img_2558
```
Search computer for file
Match from a few results to one based on date range
Upload to site and create gallery based on IDs

Method 2: via ftp backups

1. 1. Based on post date review wp_uploads directory in past backups
  2. Using post context select and position images

Change log

20241123

- The Visual Display of Quantative Information

20241124

- Chile video – easy, YouTube embed
- The signs of fall, Sunnyvale, CA – easy, set the featured image
- Chile – images, fixed some spelling errors (geez who edited that)
- Big Oak Flat backpacking, Yosemite NP, California – images
- Half Marathon, Santa Cruz, CA – images
- 5:57 PM Baby Bullet, Mountain View, CA – images
- 31st Birthday Weekend, San Francisco, CA – easy, YouTube embed, featured image (probably not the original)
- Honeymoon Video – YouTube embed not working great but still links correctly
- Party at the Ritz – added a few images and dancing video

20241125

- First Anniversary – Palacio Los Gatos – images
- Shadowbrook Restaurant, Capitola, CA – images
- CME, Wilderness Medicine Conference, Tahoe, CA – images
- Honeymoon, Portland to Santa Barbara, 12–25 October 2015 – images
- CA – Christmas Party – images
- Pikes Peak, Pike National Forest, CO – images, fixed footnote references, removed apostrophe from post title, it was correct though in the post see (1)
- Snowy Range, Medicine Bow, Wyoming – images
- Robber’s Roost, Hanksville, UT – images, a few typos
- New York City – images
- Moab – images

20241126

- S is for San Francisco, Bay Area, CA – images, wow, I’ve just realized how bad WordPress’ original spell checker was. this post had many errors, corrected with AI!
- The Visual Display of Quantitative Information – finding lots of issues with formatting of image blocks. Partly, links are messed up due to the wayback crawl, and partly due to WordPress migration over time. also, wow, spelling!
- 2020, so far – syntax refactor
- 2021: Goals – update links from the wayback machine
- Time off, 11/23/2019 – 01/06/2020 – reviewed only, thought to have not included images
- 3 weeks at the beach, Capitola, CA – images, realized the captions were in the paragraph tags without images, fixed that and will need to review other posts for this, got images even in the right places with captions thanks to wayback

How to restore text:

How to restore images:

Change log

Related