Hi again!
Some things have happened since the last (and the
first) time I wrote here. Probably the main one is that, now, I do have a path
to follow, a clearer objective to reach:
- To produce an implementation of an algorithm to verify if it is possible to determine the type of every node in a graph according to a ShEx Schemma.
- Performance testing over the obtained code.
- Establishing comparisons between my results and other people´s work over ShEx through tests.
And that’s all, for the moment. Pretty simple, isn’t
it? Helly no ;) The previous list of
objectives can suggest a lot of unresolved questions. The main ones, summary of
all of them in my opinion, are:
- What is ShEx?
- What is your plan to do such a thing like that?
I prefer to not give you a big answer for the first one. For sure, we will be talking about it in other post. Let´s just say that it is a set of expressions handy for describing how nodes
in a graph should be like in order to determine their type. Example: if you are
considering a graph of cities and, for you, a city must necessarily have a
name, a town hall, two parks and be connected with at least other two cities by
road, there is a way to indicate it using something called Shape Expressions
(ShEx). For those of you who are familiar with regular expressions, probably
ShEx syntax won´t be something shocking. If your curiosity about this topic has
not been satisfied with this brief explanation, you can get more into ShEx in
the same way I did: looking for info in papers and google J. Just
a piece of advice: look for “shape expressions”, not for “shex”. You will find
that “shex” results are not linked with Shape Expressions…
For the second question, I think there is not a simple
answer. But there are a bundle of requirements that I have to fullfil to be
able to reach my goals. Knowledge that I have to improve or acquire:
-
ShEx. Theoretical and syntactical aspects
and work that has already been done around it.
-
Graph theory.
-
Set theory.
-
Latex. If we want to produce a paper with
this, we will have to use “heavy” math. This kind of formulas that you read
thinking “oh god… Where are the numbers? What are these symbols? Is this math o
abstract art?” The common text editors do not give you an easy way to manage
that kind of symbols.
-
Scala language. My director and I have
decided that Scala can be a good option to face this work.
On this stage of the journey, I am just trying to look
for information, understanding it, and exploring tools and technologies (Scala,
Latex). Those are bumps on the road that you necessary have to manage, but it
is not productive work.
However, there are some logical and mathematical
concepts that I have went through that I have tried to “translate” to simpler
words (natural language or easy examples), that can be short cuts to
understand or remember some ideas.
I’ll put it here, to share it with you and for me, for
my future consults.
Graph theory
- Outbound neighborhood of a node “n” in a graph, also outg(n): set ot nodes you can reach from “n” using a single edge to move. It is necessary to use an edge. That means that if “n” does not have an edge pointing itself, “n” is not in outg(n).
- Labeled outbound neighborhood of a node “n” in a graph, also out-lab(n): the idea is the same of the previous case, but out-lab(n) is not a set of nodes, but a set of pairs {label, node}. All the nodes present in outg(n) will be present in lab-outg(n), forming a pair with the label of the edge that you have to use to reach them from “n”.
Set Theory
-
Bag (over a set). Really, a bag is a
function, but we can make a first approach to the idea thinking in a bag as a
kind of set that allows repetition of elements. As you probably know, a set (a
mathematical set) is a group of unordered elements in which only matters if
some element is present or not. For example, if we have a set s = {1,2,3} and
we try to add a “3” to it, the set doesn´t change, because it already had a
“3”. If we add a “4”, then the set does change. However, a bag over “s” does
take care of the number of times that a single element appears in a set. For
example, with the bag “b” = {|1,2,2,4,4|}, we are representing a kind of set in
which we have a “1”, two “2” and two “4”.
This is the idea in which bags are based, but they are
not really rare sets, but functions that can be represented in a way similar to
sets. With “b” = {|1,2,2,4,4|}, what we are really doing is defining the next
function:
o b(1) =
1
o b(2) =
2
o b(4) =
2
o b(anyOtherNumber)
= 0
-
Bag Union. The union of bags are quiet
similar to the union of sets. Probably the best starting point is an example
o {|1,2,2|}
bagUnion {|2,3,3|} = {|1,2,2,3,3|}
We add the
elements represented in both bags regarding the number of repetitions of each
element. I have written “bagUnion” instead of a mathematical
symbol because is not easy to find the symbol used to represent that kind of
operation. It looks like a Union (U) and a “+” overlapped.
Regular Bag Expressions
We can use RBEs (Regular Bag
Expressions) to define types in a graph. Let´s see an example and work over it in
order to talk about possibilities of RBEs:
Let´s consider S0 as a schema
definition. The schema contains three expressions that looks like Tn à RegularBagExpression.
Each sentence defines a type in the schema S0, and that types are T0, T1 and
T2. Thinking in graphs, we can say “type of node” instead of only “type”, and
that is going to make things easier.
Now, let´s dig into the first sentence, T0. We could translate it in the next way: “If
we want to consider a node ‘n’ of type ‘T0’, then node ‘n’ should be like we
are describing with this RBE”. At this point we can focus only in deciphering
the meaning of the RBEs ;) We can see the next elements:
-
Letters: “a”, “b” and “c”. They refer to
labels in edges.
-
Tn: “T1” and “T2”. They refer to types, the
same as they did before.
-
Double colon “::” The link an edge label
with a type. They always appear between a letter and a Tn.
-
Bar: “|”. It means alternative between the
elements that appear in its right and its left. In natural language, it means “or”.
-
Double bar: “||” It means unordered
concatenation. It means that both elements (right and left) should appear, but
without specifying the order.
-
Parenthesis: “(…..)”. No difference with
the classic meaning os parenthesis in math and logic.
-
Asterisk: “*”. It means that the element
at its left should appear from 0 to infinite times.
With this, we are in conditions to translate
in natural language the RBEs of the three sentences:
1º: To be a T0 type, you have to
have an edge labeled that point to a T1 and an edge labeled as “b” that points
to a T2. And those connections, of course, don´t have an order, we are talking
about graphs ;)
2º: To be a T1 type, you have to
have an edge labeled “a” with a T1 OR an edge labeled “b” with a T2. And that has to be truth from 0 to infinite
times. In a simpler way, to be a T1 you can have as much connections as you
want with the form a::T1 and b::T2, being not having a single connection a
valid possibility. But ey! no other connections of other forma are allowed.
3º: To be a T2 type. You must have
a connection b::t2 OR c::t1. At least one of any type, but no more than one of
each type.
And that´s enough for this time. We
will be talking about ShEx soon, and not the math that rounds it ;)
See you in the next post!