content/blog/2024-01-07-02-Semantics.md@226b8ec

crispbyte.dev-content.git / content / blog
Change series name and set clearer title
e3cc679
cheddar · 2025-04-18
2024-01-07-02-Semantics.md

 1---
 2date: 2024-01-07
 3title: Semantics of SDBD
 4series:
 5  name: "SDBD: Creating a Data Format"
 6  number: 2
 7---
 8To really start bringing this new data format to life, we need to talk about what's in it. Establishing the semantics of a format gives us the terms and concepts we need to talk about the format abstractly before we get to any concrete details.
 9## Start with the data
10Let's start with a simple example. Here is a base64 encoding of some arbitrary data.
11```text
12oSAEACBQbusr6ZZslqwlPqhpIogL6slC7t74JE2zvQjVgNvwuQIxo76Rt2W9KIJ/khyI8jgd61ZU
13uZPTWWJqf2Uw9N1cYoABdEISbjSoHOWd9JE8NIMQABpwLrv1qgE=
14```
15
16How does somebody make sense of this? On my system I know how I created this file and I know the filename, so I know exactly what it is. But without that, what do you do? Let's use the `file` tool to make a guess by running `file -i` on it.
17
18```text
19application/octet-stream; charset=binary
20```
21
22That's not useful. The output might as well have been
23
24```text
25¯\_(ツ)_/¯
26```
27
28## Add the metadata
29The only way you're going to make sense of this data is if I start giving you more information. The first thing I'm going to tell you is that this is compressed with brotli.
30
31The format will use the concept that data encoding is different from the file type. Generally that will mean we can compress the data without having to nest our metadata. I'll add a metadata field to the format that tells you the encoding.
32
33```text
34content-encoding: br
35```
36
37If you decompress the data and look at the result, it will immediately become apparent what the file is. For the sake of continuing this article, let's pretend that we're not human beings with advanced pattern recognition capabilities, but a computer that's not being allowed to guess.
38
39In order to understand this file, the next thing you need is to know the file type. Running `file -i` on the decompressed file gets it right this time, but we're no longer allowed to guess. I'll give you another metadata field that tells you the file type.
40
41```text
42content-type: text/plain; charset=us-ascii
43```
44
45It's just plain ASCII text! Now that you have this information, you can properly interpret the data. Go ahead and try decoding the data now.
46## This looks familiar...
47Go ahead and give yourself a gold star if you recognize what I've been doing so far. That's right, I am straight up ripping off HTTP. The metadata of SDBD will be a list of headers semantically equivalent to HTTP headers. I'll even say, for the sake of implementation, we'll follow similar rules:
481. Headers names are case insensitive
492. Multiple headers can have the same name
503. The order of headers is significant and must be preserved
51Note that when we get to the proof of concept, I won't be following any of these rules.
52## What's your name?
53I do need to make a few tweaks. Solving part of our original problem requires storing the filename. There aren't any standard HTTP headers that are exactly meant for this. `content-disposition` can contain a filename, but its real purpose is something else. That header will normally look something like this:
54```text
55content-disposition: attachment; filename="filename.jpg"
56```
57
58This header is meant to tell a browser whether the response should be displayed in the browser or downloaded. SDBD isn't meant specifically for browsers so we would be including useless information just to store a filename in an awkwardly formatted field.
59
60I'll create a new header for this.
61```text
62content-name: sample.txt
63```
64## When does it end?
65We need a way to identify exactly what size the data is. `content-length` is technically optional in HTTP, since you can mark the end of a response by closing the connection. But we're not making any assumptions about the context of an SDBD, so we can't assume a connection to close. I'd rather not try to create a marker for the end of the data, so I'll say that `content-length` is required.
66## What do we actually need?
67Let's wrap this up by deciding what's required, what *should* be present, and what's optional. I've already decided that `content-length` is required, and I think that's the only thing we absolutely need.
68
69I think `content-type` should be present, but I don't want to make it an absolute requirement. I don't see a use case where you wouldn't want `content-type`, but let's not limit ourselves unnecessarily. Also, since I haven't explicitly stated it yet, the value of `content-type` must be a MIME type.
70
71If the documented is encoded (compressed), `content-encoding` must also be present unless the metadata is specifically describing the encoded data. I don't want to define exactly what the value of `content-encoding` must be, other than that implementations should support the common values used on the web: `gzip`, `deflate`, and `br`.
72
73Other encodings should perhaps use their own MIME type, such as `application/x-bzip2`. I'm not ready to set that in stone in case somebody comes up with a use case for encoding that isn't about compression.
74
75The data should have a good identifier as well, so we should give it a filename with `content-name`, a URI with `content-location`, or both.
76
77Anything else is optional.
78
79With that, here is the complete metadata for the example document.
80```text
81content-name: sample.txt
82content-encoding: br
83content-type: text/plain; charset=us-ascii
84content-length: 95
85```
86
87## Sounds simple enough
88That's all the information a computer would need to interpret the example data. The format consists of the data combined with metadata that is a list of headers semantically similar to HTTP headers. There's one header required for the length of the data, and a few others that are commended. That's all there is to it.
89
90Now we need to define what this will all look like in a real binary format.