Wikifunctions:Type proposals/bytes
Summary
Bytes is a type for array of raw bytes.
Uses
To store content that is not string, e.g. image, audio or video (note external data is currently not supported in Wikifunctions). Some short example of content that is not printable string includes protobuf and ASN.1 encoded data.
- We need 33-35 byte to store one tinyint (i.e. 0-255) in an array of Natural number (Z13518), so one persistent object can store no more than 60000 such numbers; similarly we need 28 bytes for an array of Byte (Z80), so we are limited to 74000 bytes in such way.
- Storing bytes in base64 allow creating 1.5MB large binary file. (1MB if using hex and 0.5MB if using double-encoded string)
- Data larger than 1.5MB may not be stored as persistent object and must be stored elsewhere (e.g. in Commons) and received in web calls.
See also: m:Abstract_Wikipedia/Tasks#Task_P1.17:_REST_calls and m:Abstract_Wikipedia/Tasks#Task_O22:_Binary_type
Therefore we can define:
- Data shorter than 60,000 bytes is "light" data - can be stored directly as array of byte objects in JSON (though it is not performance-efficient to store like
["Z80",{"Z1K1":"Z80","Z80K1":"12"},{"Z1K1":"Z80","Z80K1":"34"}]
). - Data between 60,000 and 1,500,000 bytes is "medium" data - currently can not be store it directly as array of bytes but can be stored as Base64, or indirectly generated via function calls.
- Data longer than 1,500,000 bytes is "heavy" data - usually Wikifunctions can not represent and handle them.
Structure
JSON does support string with non-UTF-8 data, so we need to (1) either double encode it (e.g. '\\xd0\\xcf\\x11\\xe0\\xa1\\xb1\\x1a\\xe1'), or (2) store the data as Base64, or (3) hex.
Note: this is serialization format only. When executing a function, bytes in intermediate result should be stored in its raw form, not encoding/decoding once per (indirect) function calls.
We can also represent it as typed list(bytes), but (1) this does not provide a proper interface to input or output the data; (2) this is not how bytes is implemented in programming languages.
Example values
(double escaped example)
{
"type": "bytes",
"value": "\\xd0\\xcf\\x11\\xe0\\xa1\\xb1\\x1a\\xe1"
}
|
{
"Z1K1": "Zxyz",
"ZxyzK1": "\\xd0\\xcf\\x11\\xe0\\xa1\\xb1\\x1a\\xe1"
}
|
Validator
The validator ensures that:
- (double-escape) there are no overescaped characters, and no nonprintable characters
- (base64) the base64 is valid
Identity
Bytes can be compared in the normal way.
Converting to code
Python
Python has a built-in bytes type.
JavaScript
JavaScript has a built-in ArrayBuffer type.
Renderer
Either we render it as hex (e.g. d0 cf 11 e0 a1 b1 1a e1), or use Python-style byte escaping (e.g. b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1').
Parsers
Similar to renderer
Alternatives
…
Comments
- We already have the Code point (Z86) type (though we've not worked on it at all, so it's likely broken in a few ways and its UX is not great). Would it be better to get that pre-defined type updated instead? Jdforrester (WMF) (talk) 17:12, 18 March 2024 (UTC)
- @Jdforrester (WMF): What I propose is a list of octets/Byte (Z80) (potentially invalid in UTF-8), not a list of Unicode characters. For example AES key is an array of octets in specific length, NOT a list of Unicode characters. Note Python and JavaScript provide different types for string and bytes.--GZWDer (talk) 21:25, 18 March 2024 (UTC)
- @GZWDer: OK, we also have Z80 too; this would just be Z881(Z80) then? Jdforrester (WMF) (talk) 16:52, 19 March 2024 (UTC)
- @Jdforrester (WMF): (1) the interface to input and output Z881(Z80) is ugly; (2) we need 28 bytes to store each byte in a Z881(Z80), and 12-15 bytes for Python pickle dump for it. It is not an efficient way to store (and use) binary data.--GZWDer (talk) 11:38, 20 March 2024 (UTC)
- Yes, I agree that the current interface isn't lovely, but it's also something on the (long) list to fix.
- I don't know what you mean by "Python pickle dump", but the disc / network transit size of the attached packets is not a significant concern unless people are trying to abuse the system for media manipulation/etc., which is not an intended use (at least, not for now). Jdforrester (WMF) (talk) 15:47, 22 March 2024 (UTC)
- @Jdforrester (WMF): What I propose is a list of octets/Byte (Z80) (potentially invalid in UTF-8), not a list of Unicode characters. For example AES key is an array of octets in specific length, NOT a list of Unicode characters. Note Python and JavaScript provide different types for string and bytes.--GZWDer (talk) 21:25, 18 March 2024 (UTC)
- 'Data shorter than 60,000 bytes is "light" data - can be stored directly as array of byte objects in JSON'
- Please, do not suggest this. Data should exclusively be stored on Wikidata, Commons, and other venues for Wikimedia movement content. Wikifunctions is for the processing of content, not storage of it. Jdforrester (WMF) (talk) 10:26, 26 March 2024 (UTC)