Building a Notion to Markdown tool is annoying actually

Building a Notion to Markdown tool is annoying actually

hayden
hayden

Over the past few weeks, I’ve been working on bringing this blog to fruition. When we initially set out on the journey, Erik had some interesting requirements for it, but chief among them was that he wanted to be able to write the posts in Notion because of how nice its WYSIWYG editor is. That’s a fair request to have, Notion is nice for writing with. But there’s some weirdness with it too.

There are a ton of block types in Notion, from the basic stuff like headings and paragraphs to more specific stuff like Callouts.

Note

Like this callout right here.

What’s more, its API is still sorta in its infancy, and so there’s some kinda unintuitive stuff about it I had to work around.

However, we needed it, and what are programmers if not problem solvers?

Design & Tech Stack

I toyed with the idea of automating the content of the blog so it didn’t need to live in the repo, but then we threw around the possibility of using some cool MDX stuff later down the line, so I knew that we needed to have the documents editable in the repo. That drove me to decide to build the tool as a CLI instead. Being that both me and Erik are intimately familiar with Go, I chose it as the language to build it with.

So now we know what tool we need to build, and what language to build it— wait, what do you mean Notion doesn’t have a Go SDK?

Damn. Okay. Luckily, this didn’t end up being the worst thing in the world, because there’s a neat little community-built SDK for it, github.com/jomei/notionapi. Well, that solved that issue. Panic averted.

As I was saying, now we knew what language we were using as well as the fundamental “what does it need to do?” side of things, the next step was to come up with a design plan for how it would “do the thing” as the kids would say.

Notion’s API is neatly laid-out, actually

Initially, I looked at just taking a list of pages to sync from Notion, before realising that that’s going to be annoying to work with, because it means remembering that or storing the list of page IDs somewhere in the repo. Not my cuppa chai. So instead, I ended up deciding on using a Notion Database. It’s indexed, lets us search, supports custom metadata through “properties”, and still has the lovely WYSIWYG editor we like. Perfect fit. So with that in mind, the sync logic was going to look like this:

  • User runs the sync utility, passing it the ID of the database to pull from
  • The database has pages with fields for name, categories, tags, description and author.
  • The utility crawls through each page of the database and grabs all the metadata listed above, and keeps it in an object somewhere formatted in the same way Astro expects to see it.
    • During this, we pull out what element type in Markdown the Block corresponds to
    • We then go through all the rich text chunks of that block and apply any ‘annotations’ the parts of the text have. Annotations in Notion are things like bold, italicised, underscored, etc.
      • Sometimes, those blocks have ‘children’, so we need to repeat the whole thing again with indentation to mirror that in Markdown
    • We then append that block of text with formatting to a list of chunks the page should be constructed from
    • If there are images, we download them, get a hash of their byte content, and store them as <hash>.<ext> on disk so they can be used in the Markdown body. Notion’s captions are used as alt text on the images.
  • On each page, we break down every block (also a paginated API) and translate from Notion’s API information to Markdown.
  • We combine all those blocks with newline characters (\n) into a cohesive Markdown file
  • We then add the frontmatter we built in step 3 to the markdown body and persist that to disk.

That’s not too bad, right? This lets us sync in a fairly modular fashion, while making sure the markdown that we generate remains true to the Notion source of the post.

Implementing the Sync

So I built the tool using Cobra, as it was going to make the UX of the CLI fairly nice (not all that important seeing as it’s basically a tool with only one job, but still).

I’ll skip giving you the boilerplate for Cobra, it’s all freely available online, but my approach boiled down to this (pseudocode for brevity). For each page we paginate through, we send the page itself over to a function called SyncPage, which handles the markdown conversion, metadata extraction, and file writing. Keeps our codebase nice and clean.

func SyncPage(page *notionapi.Page) error {
  // Generate the frontmatter data (title, author, categories,
  // tags, description, dates, etc)
  frontmatter := getFrontmatter(page.Properties)

  // Build up the content block set
  var blocks []string
  err := notion.PaginateBlocks(notionapi.BlockID(page.ID), func (block notionapi.Block) error {
    md := markdown.FormatBlock(block, 0) // 0 here is the indentation level, because FormatBlock is recursive
		if len(md) == 0 {
      // no markdown no problems
			return nil
		}

    for _, item := range md {
      blocks = append(blocks, item)
    }
    return nil
	})

  // Generate the Markdown text with the frontmatter
  // Also generate the slugified file name to save the file to
  pageData, fileName := buildPage(frontmatter, blocks)

  // Write out the Markdown
  if err := os.WriteFile(fileName, []byte(strings.Join(blocks)), 0o644); err != nil {
		return err
	}

  return nil
}

The meat of this large-ish code block is the markdown.FormatBlock(block, 0) call. It reaches out to a function with a long-ass switch statement that basically takes a block of any type, checks what type of block it is, and then runs a function that turns it from an API response into sweet, sweet Markdown text. Like this function, which handles generating a paragraph (probably the easiest one to implement):

func Paragraph(block *notionapi.ParagraphBlock) string {
  var parts []string

  for _, text := range block.Paragraph.RichText {
    parts = append(parts, applyAnnotations(text))
  }

  return strings.Join(parts, "")
}

This is where that Rich Text iteration and annotations I mentioned before come into play. Notion text doesn’t come through like Markdown text, it comes through as an array of text elements in the order Notion has them stored, but separated by annotations. This means that for text like this:

We don’t get Hello, **world**, I am currently _italic_., instead we get this:

{
  "richText": [
    { "text": "Hello, ", "annotations": {} },
    { "text": "world", "annotations": { "bold": true } },
    { "text": ", I am currently ", "annotations": {} },
    { "text": "italic", "annotations": { "italic": true } },
    { "text": ".", "annotations": {} }
  ]
}

So that for loop in the middle of Paragraph (and most of the element functions to be fair) loops through each of these richText elements and applies the annotations, essentially creating this slice:

["Hello, ", "**world**", ", I am currently ", "_italic_", "."]

Which we then concatenate with strings.Join to generate this string:

"Hello, **world**, I am currently _italic_."

Notion’s API is full of these little annoyances that we have to account for, so there’s a lot of checking against various properties of what is realistically just text to generate a list of text then smush ‘em altogether into one coherent string.

The Result

Once all our processing has gone through correctly, we end up with neat Astro markdown files that look like this:

---
title: "Baby's first blog post"
author: "hbjydev"
pubDate: 2023-02-25
updatedDate: 2023-02-25
categories: ["things", "stuff"]
tags: ["why", "are", "you", "reading", "this", "go", "away"]
---

Hello, **world**, I am currently _italic_.

- I am a
- List
  - With children

> [!NOTE]  
> And I'm a callout from Notion as a GFM-style Alert

They just get stored in src/content/blog/<date>-<title slug>.md. Astro picks them up, Nix builds Astro, GitHub Actions pulls the Nix output, and hey presto! We have a live blog that people can browse to with content pulled from Notion.

I hope this little dive into our new blog tech was interesting, and if there’s enough interest, maybe I’ll put together a follow-up deep dive into the topic and go through the code we actually use in more depth.