Jump to content

Wikifunctions:Type proposals/Wikidata based types

From Wikifunctions

22.5px Doing... – These will be built into the system, but prototype versions have been made on-wiki: [Do not use!] Wikidata property (Z17800), [Do not use!] Wikidata item (Z17809), [Do not use!] Wikidata lexeme form (Z17810), [Do not use!] Wikidata lexeme (Z17811), [Do not use!] Wikidata statement (Z17808)

Summary

This page describes proposed Wikifunctions types for Lexemes, Lexeme forms, Wikidata items, Wikidata statements, and Wikidata properties. They are modeled closely after the structure of the corresponding types in Wikidata, and Wikifunctions' content for instances of these types will be drawn from Wikidata. The initial motivation for creating these types is to have access to lexicographic content provided by Wikidata. For an overview of the Lexicographic data model in Wikidata, see the Lexicographical data documentation on Wikidata.

Each of the five proposed types has its own top-level section, following the Uses section. After the type descriptions, there are additional sections covering discussion topics, and a section for comments.

Type references in this page: Types are in general referenced using the form "Zx/Label"; the new types proposed here are shown as "Zlll/Lexeme", "Zfff/Lexical form", "Ziii/Wikidata item", "Zsss/Wikidata statement", and "Zppp/Wikidata property", since their final Z IDs are not determined yet. Other types that don't yet exist (e.g., Lexeme sense) are shown using "Z0". For general information regarding Wikifunctions' representational model and its terminology, please see Wikifunctions:Function_model.

Uses

The proposed Lexeme, Lexeme form, Wikidata item, Wikidata statement, and Wikidata property types are needed to represent linguistic knowledge that is available on Wikidata. This knowledge will be used by a wide variety of Natural Language Generation (NLG) functions, including functions that will be used for Abstract Wikipedia. (Other uses of the more general types -- Wikidata item, Wikidata statement, and Wikidata property -- will likely arise in future.)

The initial uses of these types will be as input and output types of linguistic knowledge-access functions such as the following (which will serve as building blocks for other NLG functions). These are suggestive examples; this is not a comprehensive list and not part of the type proposal per se.

get Lexeme Forms from Lexeme
Input: Lexeme
Output: Typed list( Lexeme form )
get text from Lexeme Form
Input: Lexeme form
Output: Multilingual text
get grammatical features of Lexeme Form
Input: Lexeme form
Output: Typed list( Wikidata item )
get labels from Wikidata Item
Input: Wikidata item
Output: Multilingual text
get Form from Lexeme by grammatical features
Input 1: Lexeme
Input 2: Typed list( Wikidata item )
Output: Typed list( Lexeme form )
get text from Lexeme by grammatical features
Input 1: Lexeme
Input 2: Typed list( Wikidata item )
Output: Multilingual text
get plural from English Lexeme
Input: Lexeme
Output: Monolingual text
get grammatical gender of Lexeme
Input: Lexeme
Output: Wikidata Item
get Item value for property from Lexeme
Input 1 Lexeme
Input 2 Wikidata Property
Output Wikidata Item
get statements for property from Lexeme
Input 1 Lexeme
Input 2 Wikidata Property
Output Typed List( Wikidata Statement )
get Item value from statement
Input Wikidata Statement
Output Wikidata Item
get gender of German noun
Input Lexeme
Output German grammatical gender
choose correct German adjective Form for a noun
Input 1 Lexeme (Adjective)
Input 2 Lexeme (Noun)
Output Lexeme Form
German undetermined noun phrase from a noun and adjective
Input 1 Lexeme (Adjective)
Input 2 Lexeme (Noun)
Output Monolingual text

Lexeme

A Lexeme represents a lexeme as described in the Wikidata lexicographic data model. It roughly represents the idea of a word or an entry in a lexicon.

Keys

A Lexeme consists of the following Keys with the given value Types. Keys that are not needed for initial work on NLG functions are tagged as either Stretch goal or Out of scope.

Keys of the Lexeme type
Key Label Type
K1 identity Zlll/Lexeme
K2 lemmas Z12/Multilingual text
K3 language Z60/Natural language
K4 part of speech Ziii/Wikidata item
K5 claims Typed list( Zsss/Wikidata statement ) Stretch goal
K6 senses Typed list( Z0/Lexeme sense ) ) Out of scope
K7 forms Typed list( Zfff/Lexeme form )

Example values

Value for the Lexeme word (L3345).

{
  "type": "Lexeme",
  "identity": "L3345",
  "lemmas": {
    "type": "Multilingual text",
    "text": ["Monolingual text",
      {
        "type": "Monolingual text",
        "language": "English",
        "text": "word"
      }
    ]
  },
  "language": "English",
  "part of speech": "noun",
  "claims": ["Wikidata statement"],
  "forms": ["Lexeme form",
    {
      "type": "Lexeme form",
      "identity": "L3345F1",
      "lexeme": "L3345",
      "representations": {
        "type": "Multilingual text",
        "texts": ["Monolingual text",
          {
            "type": "Monolingual text",
            "language": "English",
            "text": "word"
          }
        ]
      },
      "grammatical features": ["Wikidata item",
        "singular"
      ],
      "claims": ["Wikidata statement"]
    },
    {
      "type": "Lexeme form",
      "identity": "L3345F2",
      "lexeme": "L3345",
      "representations": {
        "type": "Multilingual text",
        "texts": ["Monolingual text",
          {
            "type": "Monolingual text",
            "language": "English",
            "text": "words"
          }
        ]
      },
      "grammatical features": ["Wikidata item",
        "plural"
      ],
      "claims": ["Wikidata statement"]
    }
  ]
}
{
  "Z1K1": "Zlll",
  "ZlllK1": "L3345",
  "ZlllK2": {
    "Z1K1": "Z12",
    "Z12K1": ["Z11",
      {
        "Z1K1": "Z11",
        "Z11K1": "Z1002",
        "Z11K2": "word"
      }
    ]
  },
  "ZlllK3": "Z1002",
  "ZlllK4": "Q1084",
  "ZlllK5": ["Zsss"],
  "ZlllK7": ["Zfff",
    {
      "Z1K1": "Zfff",
      "ZfffK1": "L3345F1",
      "ZfffK2": "L3345",
      "ZfffK3": {
        "Z1K1": "Z12",
        "Z12K1": ["Z11",
          {
            "Z1K1": "Z11",
            "Z11K1": "Z1002",
            "Z11K2": "word"
          }
        ]
      },
      "ZfffK4": ["Ziii",
        "Q110786"
      ],
      "ZfffK5": ["Zsss"]
    },
    {
      "Z1K1": "Zfff",
      "ZfffK1": "L3345F2",
      "ZfffK2": "L3345",
      "ZfffK3": {
        "Z1K1": "Z12",
        "Z12K1": ["Z11",
          {
            "Z1K1": "Z11",
            "Z11K1": "Z1002",
            "Z11K2": "words"
          }
        ]
      },
      "ZfffK4": ["Ziii",
        "Q146786"
      ],
      "ZsssK5": ["Zsss"]
    }
  ]
}

Validator

Initially, the validator doesn't do anything. As we improve our understanding of how Lexemes are used, the validator could

  • ensure that that the languages in the lemmas field fit to the language field
  • ensure that there are lemmas
  • that the part of speech is from a correct set of part of speech for the given language
  • that the Forms point back to the Lexeme
  • that the Forms are in languages that fit to the language field
  • that the right Forms are available
  • that the Forms have the grammatical Features expected for the given part of speech and language

Identity

Two Lexemes are the same if they have the same value for identity.

Converting to code

Python

A Python dictionary that follows the structure of the ZObject.

JavaScript

A JavaScript object that follows the structure of the ZObject.

Renderer

Initially, we don't have a bespoke renderer. We plan to add one later when we understand better how the Type works.

Parsers

Initially, we don't create a bespoke parser. We plan to add one later when we understand better how the Type works.

Lexeme form

A Lexeme form represents a form as described in the Wikidata lexicographic data model. It roughly represents the idea of a word that is adapted to its grammatical role, e.g. the verb used for the third person present in English, or the noun in plural when needed.

Keys

A Lexeme form consists of the following Keys with the given value Types. Keys that are not needed for initial work on NLG functions are tagged as Stretch goal.

Keys of the Lexeme form type
Key Label Type
K1 identity Zfff/Lexeme form
K2 lexeme Zlll/Lexeme
K3 representations Z12/Multilingual Text
K4 grammatical features Typed List( Ziii/Wikidata item )
K5 claims Typed list( Zsss/Wikidata statement ) Stretch goal

Example values

Value for the plural form "colours/colors":

{
  "type": "Lexeme form",
  "identity": "L1347F2",
  "lexeme": "L1347",
  "representations": {
    "type": "Multilingual text",
    "texts": ["Monolingual text",
      {
        "type": "Monolingual text",
        "language": "British English",
        "text": "colours"
      },
      {
        "type": "Monolingual text",
        "language": "Canadian English",
        "text": "colours"
      },
      {
        "type": "Monolingual text",
        "language": "American English",
        "text": "colors"
      }
    ]
  },
  "grammatical features": ["Wikidata item",
    "plural"
  ],
  "claims": ["Wikidata statement"]
}
{
  "Z1K1": "Zfff",
  "ZfffK1": "L1347F2",
  "ZfffK2": "L1347",
  "ZfffK3": {
    "Z1K1": "Z12",
    "Z12K1": ["Z11",
      {
        "Z1K1": "Z11",
        "Z11K1": "Z1199",
        "Z11K2": "colours"
      },
      {
        "Z1K1": "Z11",
        "Z11K1": "Z1437",
        "Z11K2": "colours"
      },
      {
        "Z1K1": "Z11",
        "Z11K1": "Z1689",
        "Z11K2": "colors"
      }
    ]
  },
  "ZfffK4": ["Ziii",
    "Q146786"
  ],
  "ZfffK5": ["Zsss"]
}

Validator

The validator ensures that:

Identity

Two Lexeme forms are the same if their identity is the same.

Converting to code

Python

A Python dictionary that follows the structure of the ZObject.

JavaScript

A JavaScript object that follows the structure of the ZObject.

Renderer

Initially, we don't have a bespoke renderer. We plan to add one later when we understand better how the Type works.

Parsers

Initially, we don't have a bespoke parser. We plan to add one later when we understand better how the Type works.

Wikidata item

A Wikidata item represents an item as described in the Wikibase data model.

Keys

A Wikidata item consists of the following Keys with the given value Types. Keys that are not needed for initial work on NLG functions are tagged as Out of scope.

Keys of the Wikidata item type
Key Label Type
K1 identity Ziii/Wikidata Item
K2 labels Z12/Multilingual Text
K3 aliases Z32/Multilingual Stringset Out of scope
K4 descriptions Z12/Multilingual Text Out of scope
K5 sitelinks Typed List( Z0/Wikidata Sitelink ) Out of scope
K6 claims Typed List( Zsss/Wikidata Statement ) Out of scope

Example values

Value for plural (with only the keys which are in scope for now):

{
  "type": "Wikidata item",
  "identity": "Q146786",
  "labels": {
    "type": "Multilingual text",
    "texts": ["Monolingual text",
      {
        "type": "Monolingual text",
        "language": "English",
        "text": "plural"
      },
      {
        "type": "Monolingual text",
        "language": "Korean",
        "text": "복수"
      },
      {
        "type": "Monolingual text",
        "language": "Croatian",
        "text": "množina"
      },
      
  }
}
{
  "Z1K1": "Ziii",
  "ZiiiK1": "Q146786",
  "ZiiiK2": {
    "Z1K1": "Z12",
    "Z12K1": ["Z11",
      {
        "Z1K1": "Z11",
        "Z11K1": "Z1002",
        "Z11K2": "plural"
      },
      {
        "Z1K1": "Z11",
        "Z11K1": "Z1643",
        "Z11K2": "복수"
      },
      {
        "Z1K1": "Z11",
        "Z11K1": "Z1272",
        "Z11K2": "množina"
      },
      
  }
}

Validator

Initially, the validator doesn't do anything. As we improve our understanding of how Lexemes are used, the validator could do more things.

Identity

Two Wikidata Items are the same if they have the same value for identity.

Converting to code

Python

A Python dictionary that follows the structure of the ZObject.

JavaScript

A JavaScript object that follows the structure of the ZObject.

Renderer

Initially, we don't have a bespoke renderer. We plan to add one later when we understand better how the Type works.

Parsers

Initially, we don't create a bespoke parser. We plan to add one later when we understand better how the Type works.

Wikidata statement

A Wikidata statement represents a statement as described in the Wikibase data model. An instance of this type roughly represents a simple statement in a natural language, e.g. "Paris is the capital of France", or, more pertinent, "the French word soleil (sun) is of the masculine grammatical gender".

Note that for starters we only support item values. Initially, we will not support a representation of no-value-snaks or some-value-snaks. Initially, we do not represent qualifiers or sources. Only statements that have an item value will be represented.

Keys

A Wikidata statement consists of the following Keys with the given value Types. Keys that are not needed for initial work on NLG functions are tagged as Out of scope.

Keys of the Wikidata statement type
Key Label Type
K1 subject Z1/Object
K2 predicate Zppp/Wikidata property Stretch goal
K3 value Z1/Object
K4 qualifiers Typed List( Z0/Wikidata qualifier ) Out of scope
K5 sources Typed list( Z0/Wikidata source ) Out of scope
K6 rank Z0/Statement rank Out of scope
K7 identity Zsss/Wikidata statement Out of scope

(Out of scope) Statements with no value or an unknown value are represented by special objects.

Example values

Value for the plural form "colours/colors":

{
  "type": "Wikidata statement",
  "subject": "Wort",
  "predicate": "grammatical gender",
  "value": "neuter"
}
{
  "Z1K1": "Wikidata statement",
  "ZsssK1": "L2206",
  "ZsssK2": "P5185",
  "ZsssK3": "Q1775461"
}

Validator

The validator ensures initially that the K3/value is a Kiii/Wikidata item. It also ensures that the subject is initially either a Zlll/Lexeme, a Zfff/Lexeme form, or a Ziii/Wikidata item.

Identity

Two statements are the same if their identity is the same. Initially, there is no identity. In that case, two statements are the same if their subject, predicate, and value are the same.

Converting to code

Python

A Python dictionary that follows the structure of the ZObject.

JavaScript

A JavaScript object that follows the structure of the ZObject.

Renderer

Initially, we don't have a bespoke renderer. We plan to add one later when we understand better how the Type works.

Parsers

Initially, we don't have a bespoke parser. We plan to add one later when we understand better how the Type works.

Wikidata property

A Wikidata property represents a property as described in the Wikibase data model. The set of properties define the possible predicates that can be used in a statement.

Keys

A Wikidata property consists of the following Keys with the given value Types. Keys that are not needed for initial work on NLG functions are tagged as Out of scope.

Keys of the Wikidata property type
Key Label Type
K1 identity Zppp/Wikidata property
K2 data type Z0/Wikidata data type Out of scope
K3 labels Z32/Multilingual text Out of scope
K4 claims Typed List( Zsss/Wikidata Statement ) Out of scope

(Out of scope) Properties got quite some special handling for representing constraints, formatters, etc. These are all out of scope initially.

Example values

Value for plural (with only the keys which are in scope for now):

{
  "type": "Wikidata property",
  "identity": "grammatical gender"
}
{
  "Z1K1": "Zppp",
  "ZpppK1": "P5185"
}

Validator

Initially, the validator doesn't do anything. As we improve our understanding of how Lexemes are used, the validator could do more things.

Identity

Two Wikidata Items are the same if they have the same value for identity.

Converting to code

Python

A Python dictionary that follows the structure of the ZObject.

JavaScript

A JavaScript object that follows the structure of the ZObject.

Renderer

Initially, we don't have a bespoke renderer. We plan to add one later when we understand better how the Type works.

Parsers

Initially, we don't create a bespoke parser. We plan to add one later when we understand better how the Type works.

Alternatives

  • We could follow the Wikidata Lexicographic model less tightly
  • We could have bespoke Types for each language

Notes and Questions

Transparent handling of IDs and Literals

Throughout this proposal, we assume that we magically handle QIDs, LIDs, and FIDs just like ZIDs, i.e. like references to Objects.

Instead we could also have explicated the “dereference a QID” function, and have both Items and QIDs as separate types.

What about literal Lexemes and Lexeme Forms?

Can we write a literal inline? We don't know. Because a Lexeme has as a key the Lexeme ID, technically the answer is probably no (just as we can't write Types inline). But it would be helpful for test cases if nothing else.

We are thinking "no support for inline Lexemes", but are not sure.

Items for Languages or Objects of Type Natural Language

The proposal assumes that we transparently translate both Wikidata QIDs for languages as well as IETF language codes transparently to the appropriate Object of Type Z60/Natural Language.

Instead we could also have used the new Item type for Wikidata QIDs for a language and String for the language code, and have Functions providing the mapping.

Consecutive Key IDs

The proposal aims to predict a logical order for the complete Type. This will lead to the Keys initially have gaps (i.e. K6 on Lexeme would be initially missing).

Instead we could provide the Keys in a consecutive order. That will later lead to inconsistencies in the order between Wikifunctions and Wikidata, though.

Functions only usable in compositions

Every Function that would use one of the new dereferencing functions would be only available in compositions, which may mean that they will potentially have trouble due to our current orchestration and evaluation performance.

(Wait, is this true? If we convert Lexemes, Forms, and Items into literals, why wouldn’t it be usable in code implementation? Even without reentrance? So if we get a Lexeme, we could just have objects representing that Lexeme. That could take us somewhere, even without reentrance?)

Comments

Discussion

  • Lexeme According to the Wikidata model, a Lexeme has a single Lemma and a single language, which implies ZlllK2 should be singular and Monolingual text (Z11). Arguably, if it’s monolingual, we don’t also need Natural language (Z60) or, if we have Z60, lemma can just be a String (Z6) or, indeed, its ZfffK1 (which allows for orthographic variation without departing far from the Wikidata model).
It’s not yet clear to me how we determine the required Lexeme in the first place, but monolingual lemma to Lexeme list would be a start. In English, “word” can be a noun or a verb but Wikidata insists on a single lexical category per lexeme (ZlllK4), so a lexeme list per K4 or a list of K4s for a lemma would seem to be necessary. (That said, a lemma is just one of the lexeme’s forms, so we could just go from literal form to lexeme list(s).) --GrounderUK (talk) 11:22, 11 July 2024 (UTC)[reply]
Just looking at https://www.wikidata.org/wiki/Lexeme:L1 shows multiple lemmas (lemmata?), one for sux-latn and one for sux-xsux, so regardless of what a model claims, the reality is that they have more than one lemma (or perhaps one should say, more than one orthography of a lemma?) per Lexeme. Jdforrester (WMF) (talk) 12:00, 11 July 2024 (UTC)[reply]
@GrounderUK and Jdforrester (WMF): all lexemes have one and only one language but indeed some (60k) have them have more than one lemma (inside the same language lato sensu, but variation is important as we don't want to generate sentences that mix randomly different language stricto sensu : « organise an organization » would be weird) ; in languages with multiple writing system, it's almost all of them. And yes “word” is two different lexemes with a very different sets of forms : d:L:L3345 (a noun, with two forms: a singular and a plural) and d:L:L17039 (a verb, with 5 forms: present, past, etc.). In some extreme (and thankfully rare) cases, there is even two lexeme with the same triple of lemma/language/category (the verbs “ressortir” in French : d:L17373 and d:L691143). We need to take that into account for accessing the right lexeme. Cheers, VIGNERON (talk) 13:14, 11 July 2024 (UTC)[reply]
But do you know of any case where the lemma is (or could usefully be) more than one of the lexeme’s forms? GrounderUK (talk) 13:32, 11 July 2024 (UTC)[reply]
@GrounderUK: if I understand your question correctly, it's quite common to have several representations of form identical to the lemma (verbs in Romance languages or nouns in languages with declension comes to me mind right now as an obvious example). Cheers, VIGNERON (talk) 14:32, 11 July 2024 (UTC)[reply]
The lemma shouldn't be for different forms, just for different representations of the lemma form. -- DVrandecic (WMF) (talk) 00:49, 16 July 2024 (UTC)[reply]
If I have not misunderstood User:VIGNERON, he prefers different Breton representations of the same grammatical form to be different Lexeme forms. However, it seems to me that the Wikifunctions type should be capable of supporting all the different approaches that are (or might be) supported by Wikidata, as well as some variants that are not. It will then be possible to write functions that transform what is present in Wikidata into a new Lexeme object that is consistent with an alternative approach. For example, color/colour should not have three representations for its plural form since there are only two distinct representations and more than three English variants. Moreover, there are no irregular forms, so it might make more sense, in some context, to replace the recorded forms with a reference to the rule that is followed (which is ultimately a function). I’m guessing that Grammatical feature could distinguish between regular and irregular forms and Wikidata statement could reference the relevant function and (separately) the base Form whose representation(s) follow the rule (if this is not the lemma). It would be convenient if we could distinguish between those values that are represented on Wikidata and those that are not. GrounderUK (talk) 10:51, 16 July 2024 (UTC)[reply]
That's a really good consideration. I was thinking of the Lexeme type to be a pretty much carbon-copy of the data in Wikidata, which is one reason why it is so tightly following the Wikidata data model. If we want to extend that, for example to keep track of whether a form is generated by a Function or whether it is given in Wikidata, that would be happening entirely on Wikifunctions' side. I.e. we would have another Type that represents that. Particularly because Lexemes have an identifier, it would be good to keep the Lexemes the same as they are in Wikidata.
Or, to put it differently: one way could be to eventually have "English noun" as a Type. English noun can be constructed from a Lexeme, and then it would try to find all relevant forms and fit them in the right place. If it doesn't have certain forms, it might decide to add them through a function. "English noun" could also be constructed from a string (or two) and in that case it would also use the functions. But at this point, the English noun value would be one removed from the Lexeme. But it would be much easier for us to work with in the context of generating texts for English than a raw Lexeme is. It could also keep the information whether the forms are generated or whether they are given.
Does this make sense? -- DVrandecic (WMF) (talk) 19:10, 16 July 2024 (UTC)[reply]
It does make sense, yes. But then I would envisage non-Wikidata Lexeme types that would parallel the Wikidata Lexeme types, so that a raw Wikidata Lexeme could be transformed into a substitute that would (generally) be treated by functions as if it were a Wikidata lexeme. I imagine that would be the best way to handle alternative approaches adopted in Wikidata (and inconsistencies). GrounderUK (talk) 19:42, 16 July 2024 (UTC)[reply]
Maybe. I think Wikidata Lexemes are potentially a bit heavyweight and generic, and maybe more focused Types could be easier to handle. But I think both approaches would be valid. -- DVrandecic (WMF) (talk) 23:38, 16 July 2024 (UTC)[reply]
Yeah, there the singular lemma is the singular L1-F1 with multiple representations. I don’t know whether the lemma is always F1 but it should (by definition) be exactly one of the forms. GrounderUK (talk) 13:20, 11 July 2024 (UTC)[reply]
Maybe. I can imagine in some languages the lemma actually not be any of the forms, but a more normalized version that appears in the lexicon but not in language. But it could also be that the lemma is always one of the forms (that, I think, would be the case for the languages I speak, as far as I can tell). -- DVrandecic (WMF) (talk) 00:50, 16 July 2024 (UTC)[reply]
Getting lexeme using reverse properties from items
One interesting feature of Wikidata items and lexemes is properties that links senses to items like Template:P'. I think they could be really interesting for Abstract wiki, because they could allow the passing of one item parameter as, say, an Abstract language descriptor, and from there getting the relevant lexemes that can express this idea. I'd think it could be interesting to, say, have a descriptor for a person having a feeling like anger (Q79871), and take a language form "familiar language" or "common language" and find the ways to say someone is angry in familiar language. I noted it is not really possible with current proposition because the functions all goes in "forward" mode, there are no property like get senses with a statement with item value (sense_property, item_value, language) or something like that which could return a list of lexeme or senses. Could anything like that be incorporated ? TomT0m (talk) 13:09, 10 August 2024 (UTC)[reply]
Additional textual data on properties
Properties have aliases and descriptions, however they are not included in the type as "out of scope" keys. -- ScienceD90 (talk) 19:51, 23 August 2024 (UTC)[reply]