route-recognizer

So you just wrote your first single page application. And hopefully it had real URLs in it. But how did the application get from a URL to knowing which code to execute? And how did that running code know how to generate URLs? The answer is “serialization” and “deserialization.”

This is a seemingly simple concept but there are a lot of things to think about in handling it. As a result multiple libraries across numerous languages and frameworks all exist to help solve it. The focus here is on the approach taken by the next generation version of route-recognizer but the underlying concepts apply to all tools like this.

This is a long leisurely walk through the topic which makes no assumptions about what you know. The TL;DR is the Solution section but it won’t make much sense unless you’re already intimately familiar with the data structures and algorithms involved. Let’s get started.

All we need is a bit of data.

Since all we’re talking about is ways to store and retrieve data, let’s explore what the simplest possible approach to this problem might be. Realistically there’s nothing to stop us from taking an object like this:

And serializing it into a URL that looks like this:

https://www.example.com/%5B%7B%22name%22:%22application%22%7D,%7B%22name%22:%22users%22%7D,%7B%22name%22:%22users.user%22,%22params%22:%7B%22userID%22:%22nathanhammond%22%7D%7D%5D

In fact that would make it incredibly easy and performant to serialize and deserialize!

JSON.stringify is a serialization function that comes built in to JavaScript. We can use it to reduce an in-memory data structure to a string which we can pass around, save for later, and, eventually, deserialize. JSON.parse is a function which will deserialize a string of data which has been serialized and turn it back into an in-memory data structure identical to the original.

So that’s kind of awesome: our language already comes with everything we need to store information in a URL! It is wholly unambiguous, it’s really easy to process, all asset paths are guaranteed to be correctly relative since there is not a single / in the URL, and generally life is good. Ship it!

Except for one thing: on the web your URL is part of the user interface for the people using your application.

We need data and a user interface.

The URL that we really want doesn’t look anything like the mess that we have above:

https://www.example.com/users/nathanhammond

It’s not intimidating. A frequent visitor might begin to recognize patterns and then use those patterns to be able to quickly jump to a page. It becomes the Dewey Decimal System of your site. Having been trained for two decades to expect this format people instinctively know how to use URLs. They’re literally what makes the web, the web.

For bonus points, if you investigate this simplistic serialized state in the above diminutive example URL and compare it to the state object from way up top it appears to have most of the key information present. You could almost come up with a set of rules to describe what happens here:

  • Assume that everything gets the {'name': 'application'} item first.
  • Take the first bit of the path and add it to the array: {'name': 'users'}.
  • Take the last bit of the path and…

Hrm. There’s not enough information in just the URL anymore to be able to generate the in-memory data-structure corresponding to that URL. How can we deserialize this diminutive URL in a lossless manner to the state object above? We need for this to be symmetric, so that means that some of the information has to be stored somewhere else–in application code.

What if we could create a lookup of some sort and have the serializer/deserializer refer to that as we’re generating the data?

We need data, a user interface, custom serialization, and custom deserialization.

So if we’re going to have a lookup of some sort we need to know when to refer to the lookup. We’re now squarely in the space of building something that looks and acts like a parser. One way to write a parser is to define a formal grammar which describes what the content will look like using Backus-Naur Form. This works really well for parsing something that is in a specific format that doesn’t have any particular meaning to the parser like an address. To finish this approach you would take the grammar and feed it into a parser generator such as JISON which spits out a bit of code, your parser, that you can run against any text using that grammar. That outputted parser is then used to build up an in-memory data structure representation of the input content which you can use to understand and process it. All of that works very well for text and languages that map to a formal grammar, but URLs are anything but a formal grammar, eliminating this approach from contention.

Another way to generate a parser like the one we’re going to need is to use a different programming language to build up the in-memory representation of state that we’re looking for. This is the creation of a domain-specific language. If the audience for what you need to parse is able to write code it makes this approach an obvious winner. Inside of route-recognizer this is the approach that we take.

Here is route-recognizer’s DSL:

If you squint a little bit it starts to look like a URL. If you’re familiar with Ember you’ll notice that this looks a lot like what you end up writing inside of your application’s router.js file, This is because Ember is delegating its routing behavior to route-recognizer.

Note that the path property of the DSL for users.user leverages a microsyntax to specify that it maps to a dynamic value. route-recognizer supports three types of nodes:

  • Static segments. These are strings which directly appear in the URL.
  • Dynamic segments. Segments preceded by : are considered dynamic segments. These match values until the next / with any content at all.
  • Globbing nodes. These nodes are prefixed with * and match any content at all–even when they encounter a /.

This neatly solves “grammar of the URL” function of having a formal grammar but says nothing about how to serialize or deserialize state. Looking back at the example code using the route-recognizer DSL we could generate an in-memory-object that looks something like this:

Serialization

The to method on the above nodes in the DSL code example specifies a unique identifier for a particular node in the routing structure. That means that if we wanted to serialize state we could do something as simple as:

It gets a bit more complicated if we start to handle dynamic value insertion, but it’s still pretty straightforward:

So clearly serialization isn’t going to be a problem. But we already know from our attempt to come up with the manual algorithm to deserialize /users/nathanhammond that it’s going to be much more difficult to deserialize.

Deserialization

There are multiple approaches we can use to handle deserialization.

Regular Expressions

The first and most-obvious solution is to make everything a regular expression. Looking back to our microsyntax we have three node types:

  • For static segments we can map the value as the regular expression: /value/.
  • For dynamic segments the regular expression is /([^/]+)/.
  • For globbing “segments” the regular expression is /(.+)/.

We could then build a regular expression for each node inside of lookup, iterate over them, and compare them against the URL until we find a match.

Iterating over options without trying to prune which options to test seems like a pretty poor strategy. It is, however, incredibly straightforward to implement and has the interesting property in many higher-level languages of offloading a bunch of computation to the language underlying the runtime for possibly huge performance wins.

Radix (Prefix) Tries

The routes we’ve defined also look remarkably like a tree. It seems like we could leverage that shape to prune branches of the tree to speed up the processing. So let’s start with this tree:

Note that we’ve now rooted the route tree somewhere outside of what was specified in the DSL example. This is so we have an entry point. A radix trie is a tree typically used to do dictionary-style lookup on strings because it allows you to easily index into a set of information. Its nodes define what the value of a string is at any particular point. Looking at our route tree from above you could construct a radix trie that looks like this:

And then, because of the data structure, it is really performant to index into the trie to find the nodes that you care about:

Incredibly, we don’t even have to do a traversal of this trie, just a few property checks. But where did the nodes with ID 2 and 6 go? Dynamic segments need to be handled differently and aren’t as gracefully supported with a pure radix trie, but with a mild revision you can support them:

Taking this approach and implementation we’ve added a few constraints into the system. It happened with such subtlety that you may not have even noticed it happening. For example, in this solution there’s no way to store routes with paths that contain multiple segments:

And the inverse of that problem, seemingly a regression from the regular expression approach, routes must now also match at segment boundaries only:

We’ve also made it unclear how to support the sneaky behavior of routes that don’t consume a segment:

Or even how to support glob segments at all:

The existing data structure inside of route-recognizer, a nondeterministic finite automata (NFA), actually has an answer for each one of these problems.

Nondeterministic Finite Automata

A NFA is actually a data structure on which a regular expression engine can be built. Considering that the first tool we reached for in trying to recognize a route was indeed a regular expression, on its face this seems like a good fit.

The way a NFA works is to consume one character at a time and compare that to a state graph which it has built up. Adding just a single route of /users/new into this state graph would be the equivalent of turning it into a linked list:

Let’s add a few more:

It looks like we’ve made it possible to match on non-segment boundaries, as well as the problem of paths with multiple segments. We can also easily add in dynamic and glob segments by pointing the node back at itself and using a character class:

The comparison function for an NFA is known as the transition function. It receives as input a set of current states (in the above example a path which started with /u would be in three states) as well as the next character in the string. The transition function returns a new set of states which are fed into another iteration of the transition function as the current set of states until the input is consumed or the transition function returns an empty set.

NFAs also support a technique known as an epsilon (ε) move where you move to a different node without consuming a character from the input. This allows us to support nodes that don’t have a value in the URL string but are still present in the route tree.

Looking at the above diagram we’ve created a huge number of objects from a very small input. That’s the primary problem with NFAs: they can easily grow to a tremendous size. Setting aside for a moment the dynamic and glob segments, we’ve added what appear to be “duplicate” nodes into this state machine for us and the entire posts subtree. One way you can optimize that is through compression and taking advantage of a property known as an “accepting state,” indicated below with parentheses:

By collapsing states that are duplicated this allows us to dramatically reduce a NFA’s generated state machine size, similarly to the way a radix trie works. If we needed only true or false matching the above state machine works perfectly.

But we actually need more information than that: we need to know which route matched. This means we need to have rules around what we can collapse. In the above diagram we don’t know whether we ended on /users or /posts as those are conflated into a single accepting state. It turns out that we can’t actually compress this state machine very well:

Only the /us route is able to be compressed by collapsing nodes and leveraging accepting states. So, we’ve solved all of the problems from the radix trie approach, but we’ve incurred a significant size penalty.


So now we’ve examined how we’re deserializing data from a URL into an in-memory understanding that we are able to use. But we’ve still not solved all of the problems:

Moving from the giant serialized JSON object into this reduced state isn’t 100% lossless and has created some ambiguity. It’s possible to have a path like: /users/new. Looking at the above route definition is that a user whose userID is new? Or is it the users.new route which allows you to create a new account? How should we handle ambiguous URLs?

We need data, a user interface, custom serialization, custom deserialization, and rules for resolving ambiguous URLs.

Our goal is to be able to make it easy to define and understand how your route structure is built and parsed. There should exist no surprises when you define your routes and attempt to visit them. This means we need to have rules which result in an unambiguous outcome, every time.

Given our desire to make this as generic of a tool as possible we could try supporting matching at non-segment boundaries. But that leaves us in this somewhat literal mess:

In this admittedly slightly contrived example there’s no particularly obvious way to assume the intent of the user. You’ll recall that it was a seeming regression to be unable to match at arbitrary character boundaries, but that violates the mental model people have of URLs and makes them nearly impossible to define. As a result that will be our first constraint.

Constraint #1: All route segments must begin and end with the boundary character, /, or alternatively, at the end of the path.

Next ambiguous example:

This is a slightly more controversial opinion, but deciding between these three routes which are all matches requires us to have guidelines for consistent behavior. In Rails the approach they take for recognition is a depth-first-search with early return semantics. This means that you must sort your routes into their priority manually.

The resolution of a dynamic segment or a glob node as priority over a static segment approximately never maps to the desired outcome. A claim such as that requires extraordinary evidence, which is supported by the fact that this is the default behavior in every single Ember application in existence. One other problem with definition order being the deciding factor is that it creates a refactoring hazard which is easy to miss in your testing.

In this case we’ll define specificity such that static segments are matched first, dynamic segments second, glob nodes third

Constraint #2: Match based upon specificity as opposed to definition order.

If we’re going to support an approach that allows for specificity analysis that means we’re unable to leverage early return as effectively as Rails. Take for example this route tree:

We must effectively winnow the NFA solution set by affirmatively eliminating every single node as a possibility. We can’t avoid work by simply returning early otherwise we get the wrong answer. We can still eliminate entire subgraphs such as other in this example, but once reaching a node in the state machine we must enumerate all of its possible next states.

Constraint #3: The deserializer must affirmatively eliminate every single node in the state machine.

Constraints

In summary, here are the constraints we have defined for our system.

  1. All route segments must begin and end with the boundary character, /, or alternatively, at the end of the path.
  2. Match based upon specificity as opposed to definition order.
  3. The deserializer must affirmatively eliminate every single node in the state machine.

These constraints address the 90th percentile case, but we need to be more specific. How do we handle specificity for routes which define multiple segments?

We need data, a user interface, custom serialization, custom deserialization, rules for resolving ambiguous URLs, and support for arbitrarily complex path segments.

Turns out that our proposed set of constraints are missing a few definitions. How do you handle something that matches two different segment patterns?

The only way to differentiate between these two routes is by the order of their segment types. Therefore, in a revision to constraint #2, we should order matches based upon their specificity.

Constraint #2.1: Segments are ranked in order of their segment specificity. Scores are '3' for static, '2' for dynamic, '1' for glob, and '' (empty string) for epsilon segments.

We concatenate the scores from left to right to come up with our final score. After that, though, we can still have collisions if we have epsilon segments:

It seems clear that we should select the one which matches more segments.

Constraint #2.2: Segment count, including epsilon segments, is the second tiebreaker.

After settling segments the next thing we should take into account is having multiple ways to achieve the same radix trie. Consider this route map:

Which of those two routes should we choose? In route-recognizer parlance each of those match functions optionally has a handler. The only way to differentiate between these two routes is to count the number of handlers for that particular route, which seems like it implies greater specificity.

Constraint #2.3: If the segment score and count are identical, then the number of handlers is the third tiebreaker.

And lastly, if you were to define two identical routes:

Constraint #2.4: If the segment score, segment count, and handler count are all identical the definition order is used as the final tiebreaker: first one wins.

These are the specificity patterns currently specified by route-recognizer.

We need data, a user interface, custom serialization, custom deserialization, rules for resolving ambiguous URLs, support for arbitrarily complex path segments, and it must be fast.

The current implementation of route-recognizer uses a typical character-at-a-time NFA. This generates an in-memory representation which serializes to approximately 60 megabytes of JSON for a reasonably sized application with 500 routes.

It isn’t possible to precompute a parser with a backing data structure that large so instead it ships the grammar (the route DSL) as well as the parser generator ( route-recognizer itself) so that it can avoid shipping the resulting parser. The file size of the grammar and parser generator weighs in at about 5 kilobytes gzipped. It then generates the parser at runtime during application boot. On weaker devices this can take up to a second to generate. Further compounding the problem is the fact that it remains dormant in memory for the entire lifetime of the application creating unnecessary memory pressure which results in additional garbage collection events.

This is all clearly suboptimal and in a post-1.0 world we would like to address some of these problems. We should reduce the size of the backing data structure to the point where it can be precomputed and shipped as part of the application code. We should also make it possible to ship just the parser and not the original grammar from the DSL and the parser generation code.

Solution

Toward the end of all of the research the correct solution started to become clear: a hybrid radix trie and NFA.

  • Use the basic architecture of a radix trie as the backing data structure.
    • Each route segment becomes a state in the NFA.
  • Consume whole-segments at a time using a NFA transition function to walk the radix tree and generate solution sets.
  • For special node types take advantage of transition function implementation details which will guarantee termination as a consequence of segment consumption:
    • Epsilon segments will be inserted as a separate node class, you simply traverse to the end of them, adding each to the solution set as you encounter it.
    • Glob segments will be inserted into the segment trie as a child of itself but maintain the original parent node reference.
    • Children of optional segments will be inserted into the trie as a child of the optional segment and as a child of the optional segment’s parent.
  • Specificity priority is modeled after CSS selector priority where the list is weighted in favor of the first-most category, and then by ordering within that category.

In adopting a hybrid radix trie/NFA approach we reduce the number of states in the system tremendously. Early analysis shows that for the same 500 route grammar the backing data structure is about 100 kilobytes uncompressed and 15 kilobytes gzipped for a net increase (after removing the DSL and parser generation code) of about 10 kilobytes. This smaller data structure reduces memory pressure and will improve the overall performance characteristics of applications built using route-recognizer. On the same low-powered device time-to-first-render can be improved by 500ms or more with garbage collection reduction gains realized throughout the rest of the application lifetime.

We could also explore additional runtime optimizations by precomputing the depth of a node in the trie as well as its specificity at build time. Micro-optimizations such as these will likely have tradeoffs much sooner with any increase in file size as in most applications deserialization only happens once.


An early implementation of this approach can be found in my route-recognizer fork which has not yet been updated to take full advantage of this research.

Now we’ve figured out how to serialize data into a URL and deserialize data from a URL. But we’re still not doing anything useful with it.

We’ve managed to accomplish a bunch of neat things with this approach, but just because we can store all of this data in a pretty URL-looking format doesn’t mean that we have finished solving the problem. We’ve merely managed to address the problems with storing and retrieving data from a URL! Now we need a way to build applications that consume this data and do something useful with it!

The next post in this series on routing will talk about how router.js uses this information to build a routing library.

References