TidyTuesday: Sister Cities

data science
rstats
tidytuesday
Published

May 12, 2026

Introduction

This week’s TidyTuesday explores the connections between sister cities across the globe. Unlike previous posts, this one is going to use shiny to create an interactive app.

In particular, I’m interested in identifying chains of sister cities that connect two countries.

Setup

library(tidyverse)
library(shiny)

Loading the Data

# tuesdata <- tidytuesdayR::tt_load('2026-05-12')
# cities   <- tuesdata$cities |>
#   write_csv("cities.csv")
# links    <- tuesdata$links |>
#   write_csv("links.csv")

cities <- read_csv("cities.csv")
links  <- read_csv("links.csv")

Distance chain finding algorithm

To find the connection between two cities, we can treat the sister city relationships as an undirected graph. The following function uses the igraph package to find the the number of links in the shortest path between any two city codes. I’m also creating a tall version of this for different kinds of visualizations.

library(igraph, warn.conflicts = FALSE)

links_graph <- graph_from_data_frame(links, directed = FALSE)
distances <- distances(links_graph)

distances <- as_tibble(distances)

distances_tall <- distances |>
  pivot_longer(everything(), names_to = "to", values_to = "distance") |>
  mutate(from = rep(names(distances), ncol(distances)))
number_of_connections <- distances_tall |>
  group_by(from) |>
  summarise(total_connections = sum(distance > 0 & distance != Inf))

ggplot(number_of_connections |>
  mutate(total_connections = as.factor(total_connections)), aes(x = total_connections)) +
  geom_histogram(stat = "count") 
Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
`binwidth` and `bins`

It looks like there is one giant web of 4552 sister cities and a cluster of isolated smaller webs.

ggplot(distances_tall, aes(x = distance)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 9188278 rows containing non-finite outside the scale range
(`stat_bin()`).

Let’s connect the city ids to the remaining data and see if we can identify some additional patterns

number_of_connections_detail <- number_of_connections |>
  left_join(cities, by = c("from" = "id")) |>
  mutate(is_isolated = total_connections != 4552)

connections_formula_location <- is_isolated ~  lng + lat 
number_of_connections_detail |>
  glm(connections_formula_location, family = "binomial", data = _ ) |>
  summary()

Call:
glm(formula = connections_formula_location, family = "binomial", 
    data = number_of_connections_detail)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.9923457  0.1433680 -20.872   <2e-16 ***
lng         -0.0009952  0.0006021  -1.653   0.0984 .  
lat          0.0335798  0.0031144  10.782   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4946.3  on 5469  degrees of freedom
Residual deviance: 4773.5  on 5467  degrees of freedom
AIC: 4779.5

Number of Fisher Scoring iterations: 5

It looks like the further South you go, the more likely you are to be part of the giant network of cities. Taking this a step further, let’s plot all the latitudes and longitudes and categorize them by the number of connections.

ggplot(data = number_of_connections_detail |> mutate(total_connections = as.factor(total_connections)), aes(x = lng, y = lat, color = total_connections)) +
  geom_point() +
  theme(legend.position = "bottom") +
  scale_color_brewer(palette = "Set1")

Instead of just highlighting the isolated cities, let’s actually connect them.

links_detail <- links |>
  left_join(cities |> select(id, lng, lat), by = c("source" = "id")) |>
  rename(from_lng = lng, from_lat = lat) |>
  left_join(cities |> select(id, lng, lat), by = c("target" = "id")) |>
  rename(to_lng = lng, to_lat = lat) |>
  left_join(number_of_connections_detail |> select(from, is_isolated, total_connections),  by= c("source" = "from"))

ggplot(data = links_detail) +
  geom_point(aes(x = from_lng, y = from_lat), color = "grey", alpha = 0.2) +
  geom_segment(data = links_detail |> filter(total_connections <=3), mapping = aes(x = from_lng, y = from_lat, xend = to_lng, yend = to_lat, color = as.factor(total_connections))) +
  theme(legend.position = "bottom")

It looks like a mess at the top, but clearly being in the south means you are way less likely to have a small network of sister cities.