Request for Comments: The Fediverse Schema Observatory

I’ve been writing software for decentralized, federated social media (aka the Fediverse) since 2017. I do this work because I believe that breaking the network effects of big social media platforms and giving people more control over their social media experience will vastly improve the social media landscape.

I believe a Fediverse that connects a wide range of communities and software, and prevents a single point of influence from dominating the network, will best serve the public interest. As I said way back in 2018, “Social networks being able to talk to one another leads to a flourishing of different kinds of software for different kinds of people.” We are finally at that moment: the Fediverse has momentum, and major social media platforms are facing enough backlash that the future I envisioned can really happen.

But in order to fully seize this moment, we’ll have to improve interoperability.

Put simply, interoperability means compatibility. True interoperability prevents platform lock-in and allows people to experience social media how they want. It’s the reason I can be on Mastodon and you can be on Pixelfed and we can still be on each other’s social media feeds. Because when all of these services speak the same language, they can talk to one another. But in practice, every service has its own “dialect” and the language doesn’t align 100%. And it is hard to build interoperable Fediverse software when we don’t have the answer to a basic question like, “How can I display a poll and support every possible variant from every possible piece of software?”

So, I’ve spent the last couple of months building the Fediverse Schema Observatory, which is software designed from the get-go with privacy in mind that collects baseline anonymized data from Fediverse servers to enhance interoperability.

Having been on the Fediverse since 2016, I know that privacy and community safety are extremely important to Fediverse users. Jon Pincus describes Fediverse consent culture in Eight tips about consent for Fediverse developers: “There’s a long history of developers writing or proposing fediverse search engines, scrapers, bridges and other services that use people’s public posts without opt-in consent … and suddenly being in the middle of a firestorm of criticism and feedback.”I take community safety very seriously and I take Pincus’ advice seriously as well. I’m writing this blog post to explain what I’m doing, why I’m doing it, and to offer a public comment process before launching the Observatory. I want to be able to say with confidence that the Observatory is a public service. The only way to do that is to take feedback from the public.

Specifically, your feedback. The project as described in this post is open to change–or even cancellation–based on the feedback I get from you. If this project makes the Fediverse less safe (and I hope it does not) then there’s no point in continuing it.

What is the Fediverse Schema Observatory?

The Fediverse Schema Observatory collects baseline anonymized data to enhance interoperability across servers. The anonymized data will be released to the public domain because I believe that this data, like other general statistics collection projects such as FediDB and Fediverse Observer, is in the public interest and causes no harm.

It’s an imperfect metaphor, but one way to think about this is: I am not reading everyone’s mail. I am measuring the dimensions and weight of some mail and using that information to help build better mail trucks.

The data that’s collected is intended to be used by software developers who are trying to maximize the compatibility of software across the Fediverse network. Developers of software like Mastodon or NodeBB or Lemmy might use it to make sure that they are correctly accounting for the kind of data their users might encounter in the wild. Groups that work on developing standards like the W3C Social Web Incubator Community Group could use this data to quickly settle arguments over what kind of software uses what kind of ActivityPub features.

This project is opt-in at the server level, at least to the extent that I can make it given the technical limitations of federation (ironically because I’m throwing away all user data, a list of users who opt in is not helpful). However, there is an opt-out mechanism for servers on top of all the opt-in design, if a server chooses not to participate. More on that and the two main components of the project below.

The Fediverse server

The Fediverse server receives ActivityPub data, scrubs it of user-generated content, and records a schema to a database that is released in the public domain. Here’s a step-by-step example of what happens when a user makes a post to the Fediverse and that piece of data is eventually delivered to and processed on the server:

The admin of example.social has elected to forward all of their users’ public messages to a relay called example.relay. The user @alice@example.social makes a public post that says “Hello world!” and the message is forwarded by example.social to example.relay.
Next, example.relay forwards Alice’s message to every server that is subscribed to the relay.
One of the subscribers is the Fediverse Schema Observatory, so it gets Alice’s message in its shared inbox.
Immediately on receipt, the Observatory removes nearly all user-generated data from Alice’s post, including Alice’s user info, timestamp of the post, anyone else it was addressed to, the server that it came from, and of course the content of the post and puts it in a database (more detail on this below).
If this is the first time the Observatory has seen a message from example.social, then it makes a note like, “example.social is using Mastodon version 4.2.9”, but it does not associate this data with the particular message that came in. (It will refresh this information once every 24 hours in case the server upgrades its software to a new version.)

You’ll note that I said “nearly”. The Observatory deletes almost all user-generated data. In technical terms: I am not recording user data; I am recording a data schema and inferring its types.

But let’s get more concrete about it. Let’s say this post I made is sent to the Observatory’s inbox:

That is what the post might look like to a human user, but when it arrives in the server’s shared inbox it looks like this (before we scrub the data):

JSON

{
  "@context": [
    "https://www.w3.org/ns/activitystreams",
    {
      "ostatus": "http://ostatus.org#",
      "atomUri": "ostatus:atomUri",
      "inReplyToAtomUri": "ostatus:inReplyToAtomUri",
      "conversation": "ostatus:conversation",
      "sensitive": "as:sensitive",
      "toot": "http://joinmastodon.org/ns#",
      "votersCount": "toot:votersCount"
    }
  ],
  "id": "https://friend.camp/users/darius/statuses/113052195027718205/activity",
  "type": "Create",
  "actor": "https://friend.camp/users/darius",
  "published": "2024-08-30T17:39:56Z",
  "to": [
    "https://www.w3.org/ns/activitystreams#Public"
  ],
  "cc": [
    "https://friend.camp/users/darius/followers",
    "https://mastodon.social/users/zephoria"
  ],
  "object": {
    "id": "https://friend.camp/users/darius/statuses/113052195027718205",
    "type": "Note",
    "summary": null,
    "inReplyTo": null,
    "published": "2024-08-30T17:39:56Z",
    "url": "https://friend.camp/@darius/113052195027718205",
    "attributedTo": "https://friend.camp/users/darius",
    "to": [
      "https://www.w3.org/ns/activitystreams#Public"
    ],
	  "cc": [
	    "https://friend.camp/users/darius/followers",
	    "https://mastodon.social/users/zephoria"
	  ],
    "sensitive": false,
    "atomUri": "https://friend.camp/users/darius/statuses/113052195027718205",
    "inReplyToAtomUri": null,
    "conversation": "tag:friend.camp,2024-08-30:objectId=26481883:objectType=Conversation",
    "localOnly": false,
    "content": "<p>On Sep 12 the Applied Social Media Lab (where I work) is hosting “Beyond Discourse Dumpster Fires: Strategies and Tools for Better Online Civil Space\" to explore new ideas for healthier and more satisfying online communication.</p><p>I'm particularly excited to hear from <span class=\"h-card\" translate=\"no\"><a href=\"https://mastodon.social/@zephoria\" class=\"u-url mention\">@<span>zephoria</span></a></span>!</p><p>RSVP here, online via Zoom or (limited) in-person tickets in Cambridge MA: <a href=\"https://brk.mn/discourse\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" translate=\"no\"><span class=\"invisible\">https://</span><span class=\"\">brk.mn/discourse</span><span class=\"invisible\"></span></a>  </p><p>There will also be a recording available afterwards, so those who do not use Zoom can still watch!</p>",
    "attachment": [],
    "tag": [],
    "replies": {
      "id": "https://friend.camp/users/darius/statuses/113052195027718205/replies",
      "type": "Collection",
      "first": {
        "type": "CollectionPage",
        "next": "https://friend.camp/users/darius/statuses/113052195027718205/replies?only_other_accounts=true&page=true",
        "partOf": "https://friend.camp/users/darius/statuses/113052195027718205/replies",
        "items": []
      }
    }
  },
  "signature": {
    "type": "RsaSignature2017",
    "creator": "https://friend.camp/users/darius#main-key",
    "created": "2024-08-30T15:39:56Z",
    "signatureValue": "YUBvMYBKNWdXcoWb9uLtu38Rp... etc etc etc"
  }
}

The Observatory takes in the above data, which includes information like the username, URL of the original post, the timestamp it was posted, who was @-mentioned in the post, and the content of the post.

That data is immediately run through a scrubbing algorithm which turns it into this:

JSON

{
  "@context": [
    "https://www.w3.org/ns/activitystreams",
    {
      "ostatus": "http://ostatus.org#",
      "atomUri": "ostatus:atomUri",
      "inReplyToAtomUri": "ostatus:inReplyToAtomUri",
      "conversation": "ostatus:conversation",
      "sensitive": "as:sensitive",
      "toot": "http://joinmastodon.org/ns#",
      "votersCount": "toot:votersCount"
    }
  ],
  "id": "<uri>",
  "type": "Create",
  "actor": "<uri>",
  "published": "<date-time>",
  "to": [
    "https://www.w3.org/ns/activitystreams#Public"
  ],
  "cc": [
    "<uri>"
  ],
  "object": {
    "id": "<uri>",
    "type": "Note",
    "summary": "<null>",
    "inReplyTo": "<null>",
    "published": "<date-time>",
    "url": "<uri>",
    "attributedTo": "<uri>",
    "to": [
      "https://www.w3.org/ns/activitystreams#Public"
    ],
    "cc": [
      "<uri>"
    ],
    "sensitive": "<boolean>",
    "atomUri": "<uri>",
    "inReplyToAtomUri": "<null>",
    "conversation": "<string>",
    "localOnly": "<boolean>",
    "content": "<string>",
    "attachment": [
      "<undefined>"
    ],
    "tag": [
      "<undefined>"
    ],
    "replies": {
      "id": "<uri>",
      "type": "Collection",
      "first": {
        "type": "CollectionPage",
        "next": "<uri>",
        "partOf": "<uri>",
        "items": [
          "<undefined>"
        ]
      }
    }
  },
  "signature": {
    "type": "RsaSignature2017",
    "creator": "<uri>",
    "created": "<date-time>",
    "signatureValue": "<string>"
  }
}

What you’re seeing is not some kind of shorthand. I am literally recording the text “<uri>” instead of a URL of a post. I am literally recording the text “<date-time>” instead of the time a post was published. Importantly, I’m also recording which fields were sent by Hometown (the server) for the post. This is important information for anyone who wants to know how to write software that is compatible with Hometown. I would like to show developers that sometimes a tag contains an array of Hashtag,but sometimes it contains Mention and Emoji and whatever else people decide to put in there.

You’ll also notice that certain fields do retain their original data. Anything in the @context field stays because that’s all web standards boilerplate, which is important for programmers to know about, but doesn’t reveal anything about users. I also record when a to or cc address contains something starting with https://www.w3.org/ns/ (or a similar list of standard namespace prefixes) because it’s just one of a set of standards-based strings that indicates the privacy level of a post.

And any field called type will be recorded. This is crucial because it tells me whether something is, for example, a Note (a short text post), a Person (when someone updated their profile) or a Tombstone (which is a kind of deletion record). It won’t record what was sent in the Note, just that a Note was sent. It simply records the shape of the data.

The public domain database

Every so often a copy of the database will be released with a CC0 license, which places it in the public domain. This is so people can build other applications on top of the database if they so choose. The database will be released at intervals to give me time to audit the data to make sure nothing identifying has somehow leaked through before release.

The Fediverse Schema Observatory server is not a scraper

I repeat: the Fediverse Schema Observatory server is not a scraper.

It does not consider all public posts on the network “up for grabs.” It receives posts from public Fediverse relays, which are mechanisms for which server operators must choose to opt in. These messages are sent straight to the Observatory’s inbox.

Essentially, a Fediverse relay is a public commons for people to share and rebroadcast data. If 10 servers participate in a relay, the admin of each individual server has agreed to “donate” all their server’s public data to the relay. What they get in return is data from the other 9 servers, which helps populate their federated timelines.

I am making two important assumptions here. First, participation in a public relay indicates consent from a server admin to have their users’ public messages rebroadcast far and wide. Second, grabbing these hyperpublic messages and scrubbing all their identifying data for public open standards research is basically fine.

Unfortunately, ActivityPub is a “leaky bucket” when it comes to federated data. Because of the way federation works in 2024, sometimes public posts wind up on servers where people don’t expect them to be, all because someone boosted someone who boosted someone who boosted someone, and so on. For example, the Observatory has received data that originated from my real-life server friend.camp even though Friend Camp does not subscribe to any of the 3 relays I currently ingest data from. I refer to this phenomenon as “Announce-leaking” (“announce” is the technical name for a boost), and it’s the reason I can’t guarantee that the Observatory will only process data from servers that opt in to relays. Public posts from basically anywhere can wind up sent to a relay.

While I don’t have any huge concerns around privacy, I am recording domain names. This is so a researcher can say “Oh, interesting! friend.camp is sending a new message type I’ve never seen before. I wonder what their deal is,” and go look at the server to ensure it’s not claiming to run Hometown when really it’s running some weird custom software.

The only association I am making with domain names is that “domain X uses software Y on version Z”. That’s it! I don’t even record that a given piece of data was seen from a specific domain so there’s no way to say “friend.camp emitted 10 polls last week.” All you can say is, “We saw 10 polls last week that came from Hometown version 1.1.1, and the list of Hometown 1.1.1 servers we know about contains friend.camp and example.social.”

This means the Observatory is exposing no more data than is already exposed on a site like Fediverse Observer. But unlike that site, I am providing an opt-out for admins who don’t want their domains in the dataset at all.

How to opt out

Even though the Observatory is opt-in from the ground up, it provides an opt-out mechanism available for server admins on behalf of their servers. Unfortunately, opting out at the user level would require de-anonymizing users in order to identify who has elected to opt-out. Also, if I am promising to instantly throw away identifying information, I can’t identify who sent the message to recognize who opted out! And I can’t use something like #nobot or indexable because those don’t exist on the post level, they exist on the user level and would require me to query people’s profiles for every message I get. So server admins can email fediverseobservatory@cyber.harvard.edu and have their servers added to a list of servers that have opted out.

The Observatory website

Separate from the server and its database, I am building a web application that reads the server’s database and gives researchers answers to critical interoperability questions like:

What Fediverse software sends out posts that have a commentsEnabled property? When did we first start seeing this? Who do we contact to ask what that’s supposed to mean?
How many different kinds of objects are being sent in the tag field?
What are all the different kinds of data that different software implementations cram into the attachments field?
If I want to support showing polls that come from software beyond just Mastodon, what do I need to look for? How is a Pleroma or Misskey poll different from a Mastodon poll? Is there any other software out there I forgot about that is sending out polls?

Ideally, this creates a world where someone writing Fediverse software can use data-based decision-making to figure out how to support the widest amount of software available rather than guesswork.

This application will be open source under a to-be-determined license, and a hosted version will be available to the public to explore the data (which only a preapproved set of researchers will have the ability to annotate).

Since showing is easier than telling, here’s a brief demo video of the application in its current form:

Please give me feedback

I believe that tools like the Observatory are critical for the next big phase of the Fediverse: one where people are building software that truly takes advantage of everything federation has to offer, not merely cloning existing services with a sprinkle of decentralization on top. I believe that the work to get us to that place is in the public interest, but I also believe that anything that claims to be in the public interest must have genuine buy-in from the public.

That’s why I want to hear from you. Whether you’re a dedicated Fediverse admin or user, or simply someone who believes we can build a better Internet, your thoughts on this project.

So please send your feedback via email to fediverseobservatory@cyber.harvard.edu. I’m especially interested in hearing about safety and privacy concerns you might have, but I’m happy to hear just about any thoughts, positive or negative. The public comment period will last until December 1, 2024, but I’ll extend the period to the end of the year if it looks like we need more time.

Thanks for reading, and for helping me out with this project.