5 Degrees of Pokémon Bacon

One of the side projects I work on outside of my job is a Japanese anime site called AniDB. It was put together by some German guys quite some time ago with Perl, PostgreSQL, and some old donated hardware/bandwidth from an ISP. They were kind/foolish enough to let me help out with the site about two years ago. Anidb has almost anything you could possibly need to know about Japanese anime series (much like imdb for movies). Including things like what characters know each other between the different series.

satoshiAniDB’s data is populated by its users.  Unfortunately, not all of them are as careful as others about providing complete entries for the character data they add.  Many of the minor characters get entered but never linked to the rest of the story.  We needed to find these missing/orphaned characters so they can be investigated and cleaned up by the AniDB staff.  The question we were trying to answer was, “out of any character in any Pokémon series, who doesn’t have some sort of relationship to Ash/Satoshi (the main character)?” It’s almost impossible to not SOMEHOW be related to the main character.   Because of how the reporting system is setup, we also wanted this to be a single query so we could just throw it in a report.  AniDB’s tables are setup a little weird. This is in part due to necessary design and in part due to historical reasons that stem from organic growth.  This design is part of the complexity of the problem.

For the first part of our report, we wanted to get a list of EVERY character in any type of Pokemon anime. To do that, the tables we have are:

animetb -- The Anime Table
-------
id -- Primary Key for the anime
...

seqtb -- The Sequel Table
-----
aid -- Foreign Key to animetb. The subject anime.
next -- Foreign Key to animetb. The object anime (sequel)
type -- What kind of relation. Sequel, side-story, OVA, etc.

So to start with, we had the original Pokemon anime that started it all in 1997.  This is anime id 230 in our system.  We need to traverse the tree structure of anime relations to get a list of every anime related to the original Pokemon.  This in itself can be quite complicated.  Here’s a graph of how all these anime are related.  To do this I needed a great feature of PostgreSQL called Recursive Common Table Expressions (CTEs).  To explain how these work, lets go through the (slightly modified) example in the PostgresSQL docs.

WITH RECURSIVE temp(n) AS (
  VALUES (1)
  UNION ALL
  SELECT n+1 FROM temp WHERE n < 10
)
SELECT sum(n) FROM temp;

The way this works is, we create a named recursive CTE called “temp”.  Recursive CTEs contain several parts.  The non-recursive term: VALUES(1), which unsurprisingly just returns the result set of “1”.   This is following by a UNION or UNION ALL connection.  Then the fun part. The recursive part:

SELECT n+1 FROM temp WHERE n < 10

When Postgres runs this query, it will see the recursive CTE and evaluate the first part, which returns the single value 1.  It then takes this result set as the content of the “temp” table and will then run the second recursive part with this result set.   Since we’re just selecting N+1, the result, in this case, will be “2”.  Since this iteration created new values to add to “temp”, it will incrementally keep going and get “3” then “4” and so forth until the query returns nothing, at which point it will end.  The final result of the recursive CTE is a “temp” table with 10 rows in it.  Values  “1” through “10.”   We can then query “temp”, in our final query, to sum() it and get our grand total of “55.”

We’re going to use the same strategy with anime.  We’re returning our hardcoded first Pokemon anime (id 230) as the starting value, and then  start pulling the relations with this query:

WITH RECURSIVE pokemon_animes(animeid) AS (
  VALUES( 230 )
  UNION
  SELECT s.next FROM seqtb s, pokemon_animes p WHERE p.animeid = s.aid
)
SELECT animeid FROM pokemon_animes ORDER BY animeid;

Our first iteration will return 230, which we will run the recursive query with returning our second set :

921 | Pocket Monsters: Mizu no Miyako no Mamorigami Latias to Latios
990 | Gekijouban Pocket Monsters: Mewtwo no Gyakushuu
991 | Pocket Monsters: Celebi Toki o Koeta Deai
992 | Pocket Monsters: Kesshoutou no Teiou
1020 | Pocket Monsters: Maboroshi no Pokemon Lugia Bakutan
1041 | Pocket Monsters Advanced Generation
3491 | Pocket Monsters: Pikachu no Fuyuyasumi (1999)
3492 | Pocket Monsters: Pikachu no Fuyuyasumi (2000)
3493 | Pocket Monsters: Pikachu no Fuyuyasumi (2001)
7117 | Pocket Monsters Crystal: Raikou Ikazuchi no Densetsu

Our recursive query will then get run again with the above set of ids returning another, third set, and so on until we have no more results.  This gives us a list of all the Pokemon anime that we can use to get all of the character ids.  To pull these character ids will require use of another table from AniDB.

chartb -- The Character Table
id : Primary Key for the character
parentid : The 'guise' for the character.
…

By using our list of anime ids and AniDB’s “charanimereltb” link table, we can get a listing  of all characters in any Pokemon anime with:

WITH RECURSIVE pokemon_animes(animeid) AS (
  VALUES( 230 )
  UNION
  SELECT s.next FROM seqtb s, pokemon_animes p WHERE p.animeid = s.aid
)
SELECT c.charid FROM charanimereltb c, pokemon_animes a WHERE c.aid = a.animeid;

That gets us halfway there!  Now we just need all the characters related to Satoshi to compare the set to.  The parentid/guise from the table above was the first type of relationship tracked in AniDB.  The guise is used to associate alternate forms of a character.  Because this relationship is chronological in story lines, it can usually be represented this way without any problems.  As time passed, a real relations table had been added.  However, since guises were already in the chartb, these were never moved because of the number of changes that would be required to do so.  Technical debt is incurred, but usually we can work around that.  However, eventually this system was further abused by some of the data entry people.    The ‘wild’ version of a Pokemon was created and guises were created for more specific versions of the individual trainer’s instances of these fur balls.  This brings us to where we are now.

charcharreltb -- The Character to Character Relation Table
charid : The subject of the relation. FK to chartb
next : The object of the relation. Also a FK to chartb.
type : How they are related, friend/enemy/etc
...

fushiThis setup causes some fun problems. In this case, to really find out everyone who is related to a character, you not only want all of the relationships about that instance of the character, but any other versions/guises of the character as well (or any other versions/guises of the relations you are related to.)

For example (and this will get geeky), Satoshi and Fushigidane are related to each other as Fushigidane (Japanese for Bulbasaur), which is Satoshi’s personal trained Pokémon. Fushigidane is a type/guise of Bulbasaur. Bulbasaurs can evolve into any Ivysaur. Thus all 3 forms in the system are “related” to Satoshi.

6693 <-related-> 22350 -guise-> 18 -relation-> 20

Or for even crazier example.  A Clefairy/Pippi is a type of Pokemon. These are a guise of a particular Clefairy (id 23859) that was trained/owned by Akai Isamu(id 23857). Akai also has another personal trained Pokemon (id 23858) that is a Pikachu type/guise (id 69). That, of course, is also a guise to the Pikachu (id 8091) that everyone knows and loves belonging to Satoshi (6693). Therefore, all of these characters are related to Satoshi.

169 <-guise_of- 23859 <-related-> 23857 <-related-> 23858 -guise_of-> 69 <-guise_of- 8091 <-related-> 6693

This is a pretty nasty chain. And remember, we need to do it with a single query!  One limitation of recursive queries is that you can only refer to the recursive table once and it may not be (unfortunately) in a subquery.   This provided a challenge, since we needed to return a single value, but I needed to pull from several different potential relations or guises!   I was finally able to work around this with two other great utilities that PostgreSQL offers – Arrays and the unnest() function, which will expand an array into a set of rows.  It ended up looking something like this:

pippi

WITH RECURSIVE related_to_satoshi(charid) AS (
  VALUES (6693)
  UNION
  SELECT unnest(ARRAY[c.next, n.charid, d.parentid, e.parentid, g.id])
  FROM related_to_satoshi s
    -- related to character
    LEFT JOIN charcharreltb c ON (c.charid = s.charid)
      -- get the character to get any guise of them
      LEFT JOIN chartb d ON (c.charid = d.id)
        -- get any reverse relations to the character
        LEFT JOIN charcharreltb n ON (n.next = s.charid)
      -- get that character to get any guise from them
      LEFT JOIN chartb e ON (n.next = e.id)
    -- get my guise so we can look it up next iteration too
    LEFT JOIN chartb g ON (s.charid = g.parentid)
)

Basically, we pull in everyone who was related, or a guise of the character, and then keep iterating over them.  We throw them all into the array, and then unnest it for the final result set.   Because we UNION everything we remove any duplicate values and keep us from going into any possible loops (important!).

Adding in a non-recursive CTE for clarity we can come to our final solution of:

WITH RECURSIVE related_to_satoshi(charid) AS (
  VALUES (6693)
  UNION
  SELECT unnest(ARRAY[c.next, n.charid, d.parentid, e.parentid, g.id])
  FROM related_to_satoshi s
    LEFT JOIN charcharreltb c ON (c.charid = s.charid)
      LEFT JOIN chartb d ON (c.charid = d.id)
    LEFT JOIN charcharreltb n ON (n.next = s.charid)
      LEFT JOIN chartb e ON (n.next = e.id)
    LEFT JOIN chartb g ON (s.charid = g.parentid)
),
pokemon_animes(animeid) AS (
  VALUES( 230 )
  UNION
  SELECT s.next FROM seqtb s, pokemon_animes p WHERE p.animeid = s.aid
),
all_characters AS (
  SELECT DISTINCT c.charid FROM charanimereltb c, pokemon_animes a WHERE c.aid = a.animeid
)
SELECT DISTINCT a.charid as not_related
FROM all_characters a
LEFT JOIN related_to_satoshi r USING(charid)
WHERE r.charid IS NULL
ORDER BY a.charid;

The editors were thrilled and this has started to become a new report for helping clean up the system!   Finally we could catch them all!   PostgreSQL was chosen as the database for AniDB because of its reliability and ability to do these tough reports.  It’s not a decision we’ve ever regretted.

 “Meowth! That’s right!”

Leave a Comment

You must be logged in to post a comment.