The Semantic DB Project: February 2015

Friday 27 February 2015

new function: such-that[]

So, for a long time I thought I needed a "set-builder" notation to complete the maths structure of BKO. But I'm now of the opinion we don't need such a thing, and it would actually weaken our model rather than fit in.

The proposed forms for BKO set-builder were:

KET in SUPERPOSITION such that TRUTH-STATEMENT
OP-SEQUENCE KET for KET in SUPERPOSITION such that TRUTH-STATEMENT

and a couple of examples of this:
"Who was the president in 1825?"

|x> in "" |early US Presidents: _list> such that <year: 1825|president-era|x> > 0.5

"What was the party of the third president?"

party |x> for |x> in "" |early US Presidents: _list> such that president-number|x> == |number: 3>

But now we have:

such-that[op1,op2,...] |x>

returns |x> if op_k |x> is true/yes for all operators op1, op2, ..., |> otherwise.
And it is linear, so if you apply it to a superposition it only keeps the results that are satisfied, and |> as identity element keeps it tidy for those kets that are not satisfied.

Python:

def such_that(one,context,ops):
  for op in ops.split(','):
    label = one.apply_op(context,op).the_label().lower()
    if label not in ["true","yes"]:
      return ket("",0)
  return one

So, now lets do the two examples we mention above:

sa: load early-us-presidents.sw
sa: in-office-in-1825 |*> #=> is-equal[1825] president-era |_self>
sa: such-that[in-office-in-1825] "" |early US Presidents: _list>
|Monroe> + |Q Adams>

sa: is-third-president |*> #=> is-equal[3] president-number |_self>
sa: such-that[is-third-president] "" |early US Presidents: _list>
|Jefferson>

sa: party such-that[is-third-president] "" |early US Presidents: _list>
|party: Democratic-Republican>

And observe it does the same job as set-builder while preserving the OP-SEQUENCE KET structure.

Now, a quick toy example (randomly select kets from a superposition):

sa: random |*> #=> pick-elt (|yes> + |no>)
sa: such-that[random] split |a b c d e f g h i j k>
|c> + |f> + |g> + |i>

sa: such-that[random] split |a b c d e f g h i j k>
|a> + |b> + |g> + |h>

Now a breakfast menu example:

sa: load breakfast-menu.sw
-- "What can I get for breakfast that is under $6?"
sa: is-under-6-dollars |*> #=> is-less-than[6] price |_self>
sa: table[food,price,calories,description] sort-by[price] such-that[is-under-6-dollars] "" |menu: breakfast>
+-----------------+-------+----------+---------------------------------------------------------------------+
| food            | price | calories | description                                                         |
+-----------------+-------+----------+---------------------------------------------------------------------+
| French Toast    | 4.50  | 600      | "Thick slices made from our homemade sourdough bread"               |
| Belgian Waffles | 5.95  | 650      | "Two of our famous Belgian Waffles with plenty of real maple syrup" |
+-----------------+-------+----------+---------------------------------------------------------------------+

-- "What can I get for breakfast that is under 700 calories?"
sa: is-under-700-calories |*> #=> is-less-than[700] calories |_self>
sa: table[food,calories,price,description] sort-by[calories] such-that[is-under-700-calories] "" |menu: breakfast>
+-----------------+----------+-------+---------------------------------------------------------------------+
| food            | calories | price | description                                                         |
+-----------------+----------+-------+---------------------------------------------------------------------+
| French Toast    | 600      | 4.50  | "Thick slices made from our homemade sourdough bread"               |
| Belgian Waffles | 650      | 5.95  | "Two of our famous Belgian Waffles with plenty of real maple syrup" |
+-----------------+----------+-------+---------------------------------------------------------------------+

And I guess that is about as clear as I can make it. Note we do have to do a little dancing, as in common in the BKO scheme. We have to predefine our truth-statement operators before we can use them in such-that[]. I don't know of a good way to solve that, and maybe it is not an issue. We have to do likewise with tables quite often, it is only 1 line of code to do so, and it is easy enough to load collections of rules from sw files. Indeed, with further thought, this actually encourages modularity, which is a good thing!

Also, I should mention there are still some things we can't do given the current parser. eg:
"I want something with bacon in it!"
In set builder would look like:
|answer> => |x> in |menu: breakfast> such that <food: bacon|read:description|x> > 0.9

Using such-that, would probably look like:
contains-bacon |*> #=> |food: bacon> in read description |_self>
|answer> => such-that[contains-bacon] "" |menu: breakfast>

The bit currently not implemented is:
KET in SUPERPOSITION

Anyway, that's it for this post. Heaps more to come!

Update: we can actually do this, and we don't need any new parser code. We can use either the intersection() function, or the new mbr() function.

sa: contains-bacon |*> #=> do-you-know mbr(|word: bacon>,read description |_self>)
sa: such-that[contains-bacon] "" |menu: breakfast>
|food: Homestyle Breakfast>

Very cool!

Sunday 22 February 2015

simple image recognition

Today, we use similar[op] for simple image recognition. Simple in that we don't need to do any normalizations for it to work. We don't need to translate, rotate, magnify, or otherwise align. And we restrict our pixel values to 0 or 1 (though that is not important, coeffs not in {0,1} should be fine too).

Here are our images:

|letter: H>
#   #
#   #
#   #
#####
#   #
#   #
#   #

|noisy: H>
    #
#   #
#   #
### #
#    
#   #
#   #

|noisy: H2>
#   #
#    
# ###
#####
##  #
#   #
### #

|letter: I>
#####
  #
  #
  #
  #
  #
#####

|noisy: I>
####
  #
  
  
  #
  #
# ###

|noisy: I2>
##  #
 ###
  #
  #
  ###
####
#####

|letter: O>
######
#    #
#    #
#    #
#    #
#    #
######

Now, load these into the console, and have a play.

-- load up the data:
sa: load H-I-pat-rec.sw

-- save some typing:
sa: |list> => relevant-kets[pixels] |>

-- do some processing: 
sa: simm |*> #=> 100 self-similar[pixels] |_self>
sa: drop-simm |*> #=> drop-below[60] 100 self-similar[pixels] |_self>
sa: tidy-simm |*> #=> set-to[100] drop-below[0.6] self-similar[pixels] |_self>
sa: map[simm,similarity] "" |list>
sa: map[drop-simm,drop-similarity] "" |list>
sa: map[tidy-simm,tidy-similarity] "" |list>

-- now check the results:
sa: matrix[similarity]
[ letter: H ] = [  100.00  29.41   40.91   82.35   76.19   17.65   35.00   ] [ letter: H ]
[ letter: I ]   [  29.41   100.00  45.45   26.67   38.10   73.33   65.00   ] [ letter: I ]
[ letter: O ]   [  40.91   45.45   100.00  36.36   50.00   36.36   40.91   ] [ letter: O ]
[ noisy: H  ]   [  82.35   26.67   36.36   100.00  61.90   14.29   25.00   ] [ noisy: H  ]
[ noisy: H2 ]   [  76.19   38.10   50.00   61.90   100.00  19.05   47.62   ] [ noisy: H2 ]
[ noisy: I  ]   [  17.65   73.33   36.36   14.29   19.05   100.00  45.00   ] [ noisy: I  ]
[ noisy: I2 ]   [  35.00   65.00   40.91   25.00   47.62   45.00   100.00  ] [ noisy: I2 ]

sa: matrix[drop-similarity]
[ letter: H ] = [  100.00  0       0       82.35   76.19   0       0       ] [ letter: H ]
[ letter: I ]   [  0       100.00  0       0       0       73.33   65.00   ] [ letter: I ]
[ letter: O ]   [  0       0       100.00  0       0       0       0       ] [ letter: O ]
[ noisy: H  ]   [  82.35   0       0       100.00  61.90   0       0       ] [ noisy: H  ]
[ noisy: H2 ]   [  76.19   0       0       61.90   100.00  0       0       ] [ noisy: H2 ]
[ noisy: I  ]   [  0       73.33   0       0       0       100.00  0       ] [ noisy: I  ]
[ noisy: I2 ]   [  0       65.00   0       0       0       0       100.00  ] [ noisy: I2 ]

sa: matrix[tidy-similarity]
[ letter: H ] = [  100.00  0       0       100.00  100.00  0       0       ] [ letter: H ]
[ letter: I ]   [  0       100.00  0       0       0       100.00  100.00  ] [ letter: I ]
[ letter: O ]   [  0       0       100.00  0       0       0       0       ] [ letter: O ]
[ noisy: H  ]   [  100.00  0       0       100.00  100.00  0       0       ] [ noisy: H  ]
[ noisy: H2 ]   [  100.00  0       0       100.00  100.00  0       0       ] [ noisy: H2 ]
[ noisy: I  ]   [  0       100.00  0       0       0       100.00  0       ] [ noisy: I  ]
[ noisy: I2 ]   [  0       100.00  0       0       0       0       100.00  ] [ noisy: I2 ]

So, a nice simple proof of concept example, but it successfully classifies letters into the right groups. H with H, I with I, and O by itself.

A more interesting example coming up!

Update: I guess it would be instructive to see what we have at the superposition level:

similarity |letter: H> => 100.000|letter: H> + 82.353|noisy: H> + 76.190|noisy: H2> + 40.909|letter: O> + 35.000|noisy: I2> + 29.412|letter: I> + 17.647|noisy: I>
drop-similarity |letter: H> => 100.000|letter: H> + 82.353|noisy: H> + 76.190|noisy: H2>
tidy-similarity |letter: H> => 100.000|letter: H> + 100.000|noisy: H> + 100.000|noisy: H2>

similarity |noisy: H2> => 100.000|noisy: H2> + 76.190|letter: H> + 61.905|noisy: H> + 50.000|letter: O> + 47.619|noisy: I2> + 38.095|letter: I> + 19.048|noisy: I>
drop-similarity |noisy: H2> => 100.000|noisy: H2> + 76.190|letter: H> + 61.905|noisy: H>
tidy-similarity |noisy: H2> => 100.000|noisy: H2> + 100.000|letter: H> + 100.000|noisy: H>

similarity |letter: I> => 100.000|letter: I> + 73.333|noisy: I> + 65.000|noisy: I2> + 45.455|letter: O> + 38.095|noisy: H2> + 29.412|letter: H> + 26.667|noisy: H>
drop-similarity |letter: I> => 100.000|letter: I> + 73.333|noisy: I> + 65.000|noisy: I2>
tidy-similarity |letter: I> => 100.000|letter: I> + 100.000|noisy: I> + 100.000|noisy: I2>

similarity |letter: O> => 100.000|letter: O> + 50.000|noisy: H2> + 45.455|letter: I> + 40.909|letter: H> + 40.909|noisy: I2> + 36.364|noisy: H> + 36.364|noisy: I>
drop-similarity |letter: O> => 100.000|letter: O>
tidy-similarity |letter: O> => 100.000|letter: O>

Friday 20 February 2015

some simple similar[op] examples

In this post just a few simple examples of similar. More interesting examples in future posts.

First, say we know a little about oranges. We know their taste, their smell, their colour, and their size. And say we know those operators for other objects too. Then, in an English like way, we can ask:

similar[taste] |orange>
similar[smell] |orange>
similar[colour] |orange>
similar[size] |orange>

and we should get back expected results. Now, how about a working example?

-- learn some toy knowledge about disease symptoms:
disease-symptoms |disease: 1> => |symptom: a> + |symptom: b> + |symptom: c>
disease-symptoms |disease: 2> => |symptom: d> + |symptom: e>
disease-symptoms |disease: 3> => |symptom: f> + |symptom: g> + |symptom: h> + |symptom: i>
disease-symptoms |disease: 4> => |symptom: j> + |symptom: k> + |symptom: l>

-- learn some symptoms of a new patient:
disease-symptoms |patient: 1> => |symptom: b> + |symptom: h> + |symptom: f> + |symptom: k>

-- now ask, and see what the possible disease is for our patient?
-- and the 100 is in there to convert it to percent
sa: 100 similar[disease-symptoms] |patient: 1>
50.000|disease: 3> + 25.000|disease: 1> + 25.000|disease: 4>

So 50% chance of disease 3, and 25% for disease 1 and 4.

Now, another example. Say a shopping centre keeps track of all your purchases (heh, not a big stretch!). Now, given a new shopping basket just rung up, can we guess who it might be? Yup.

-- this is what the shop knows:
basket |user 1> => |milk> + |bread> + |tea> + |bananas> + |carrots> + |chocolate>
basket |user 2> => 4|apple> + |milk> + |coffee> + |steak>
basket |user 3> => |chocolate> + |vegemite> + |olive oil> + |pizza> + |cheese>
basket |user 4> => |vegemite> + |cheese> + |bread> + |salami>

-- a new basket has just been processed:
basket |f> => 3|apple> + 5|oranges> + |milk> + |bread> + |coffee> + |steak>

-- now ask, which user is this likely to be?
sa: 100 similar[basket] |f>
50.000|user 2> + 16.667|user 1> + 8.333|user 4>

So, 50% chance it is user 2. BTW, this example relies on the assumption that from shopping trip to shopping trip you largely purchase the same thing. Or at least that your average shopping basket converges to something that can identify you. And given that a large supermarket may have on the order of 40,000 different products, it is quite likely your shopping basket can identify you with high confidence!

Anyway, the above is really just proof-of-concept for similar[op] |x>. In a future post I will give a bigger example.

Update: I guess the point is that it is all abstract and general, so we can apply the exact same technique all over the place. eg, perhaps identifying you from your browser details similar to EFF's panopticlick. All you need are appropriately defined operators mapping objects to superpositions.

introducing: similar[op] |x>

We finally have enough pieces in place to describe the similar function operator. This beasty being quite useful! Arrogantly enough, I'm calling this thing "pattern recognition". In this post just the code. Examples will follow in future posts.

Python:

-- in the new_context() class:
# just simm applied to relevant kets
def pattern_recognition(self,pattern,op,t=0):
  if type(op) == ket:
    op = op.label[4:]
  result = superposition()
  for label in self.ket_rules_dict:
    if op in self.ket_rules_dict[label]:
      candidate_pattern = self.recall(op,label,True)
      value = silent_simm(pattern,candidate_pattern)
      if value > t:                             
        result.data.append(ket(label,value))  # "result += ket(label,value)" when we swap in fast_superposition
  return result.coeff_sort()

-- in the ket() class:
# implements: similar[op] |x>
def similar(self,context,op):              
  f = self.apply_op(context,op)            
  return context.pattern_recognition(f,op).delete_ket(self)

Now, try to explain in English what is going on.
Consider: similar[example-operator] |x>

First, find the superposition resulting from "example-operator" applied to |x>.

  f = self.apply_op(context,op)

Then find all the kets that support the "example-operator" operator.

  for label in self.ket_rules_dict:
    if op in self.ket_rules_dict[label]:

Then find the superpositions resulting from "example-operator" applied to each of them.

      candidate_pattern = self.recall(op,label,True)

Then find how similar they are to f/pattern:

      value = silent_simm(pattern,candidate_pattern)

Keep only the interesting ones:

      if value > t:                             
        result.data.append(ket(label,value))

Sort and return the result:

  return result.coeff_sort()

Delete |x> since we are always 100% similar with ourselves:

  return context.pattern_recognition(f,op).delete_ket(self)

Update: Tweaked the code so we have a version of similar[op] that doesn't delete |x>

-- in the ket() class:
# implements: self-similar[op] |x>
  def self_similar(self,context,op):  
    f = self.apply_op(context,op)            
    return context.pattern_recognition(f,op)

Update: tweaked similar so that if we want, the operator we apply to |x> is different to the operator we use in looking for matching patterns.

The new python:

-- in the ket() class:
def similar(self,context,ops):              
    try:
      op1,op2 = ops.split(',')
    except:
      op1 = ops
      op2 = ops 
    f = self.apply_op(context,op1)            
    return context.pattern_recognition(f,op2).delete_ket(self)

def self_similar(self,context,ops):
    try:
      op1,op2 = ops.split(',')
    except:
      op1 = ops
      op2 = ops 
    f = self.apply_op(context,op1)            
    return context.pattern_recognition(f,op2)

Update: It should be rather trivial to write a parallel version of context.pattern_recognition() in the case of large sw files. Roughly, partition the sw file. Calculate the resulting superposition for each. Add them up, do a coeff sort. Done.

the landscape function

Now we have described our simm adequately, we can start to put it to use. In this post, a brief one, the landscape function. I guess briefly, the landscape function maps an incoming pattern f into a mathematical surface, though almost always that surface is not continuous/smooth.

Say we have a different pattern g(x) at each point x, then simply enough:
L(f,x) = simm(f,g(x))

So, what use is this? To be honest I haven't used it much, but perhaps something like this. Say you have mapped an image of your lounge-room into superpositions of features at point x in the room. ie, we have a g(x). This of course is currently hard!

At x0 we might have g(x0) = |object: light switch>
At x1 we might have g(x1) = |object: books> + |object: book-shelf>
At x2 we might have g(x2) = |object: TV>
and so on.

Then consider say f = |object: TV>, then:
x such that L(f,x) > 0 gives the set of points in the image that contain a TV.

If f = |object: light switch>, then:
x such that L(f,x) > 0 gives the set of points in the image that contain a light switch.
and so on.

Another perhaps better example is consider DNA microarrays. Here is a picture borrowed from wikipedia:

So in this case, g(x) can be considered to be the DNA pattern at location x.
And f is the target sample of DNA. So, a nice physical implementation of the landscape function. And I guess we are indirectly implying the strength of bonding between DNA strands has a strong similarity with the simm. Something we can realise if we find a good mapping between a DNA sequence and a superposition. Hrmm... maybe we need sequences instead! Anyway, DNA vs BKO is something that calls for more thought.

I guess that is about it for this post.

Thursday 19 February 2015

a superposition implementation of simm

Previously I gave a list implementation for simm.
Recall:

simm(w,f,g) = (w*f + w*g - w*[f - g])/2.max(w*f,w*g)
simm(w,f,g) = \Sum_k w[k] min(f[k],g[k]) / max(w*f,w*g)

In this post I'm going to give a couple of superposition implementations of this equation. And we should note in the BKO scheme most superpositions have coeffs >= 0, so we can use the min version of simm given above. So, first we need to observe some correspondences:

\Sum_k x[k] <=> x.count_sum()
min[f[k],g[k]] <=> intersection(f,g)
rescale so w*f == w*g == 1 <=> f.normalize() and g.normalize()

And so, some python:

# unscaled simm:
def unscaled_simm(A,B):
  wf = A.count_sum()
  wg = B.count_sum()
  if wf == 0 and wg == 0:
    return 0
  return intersection(A,B).count_sum()/max(wf,wg)

# weighted scaled simm:
def weighted_simm(w,A,B):
  A = multiply(w,A)
  B = multiply(w,B)
  return intersection(A.normalize(),B.normalize()).count_sum()

# standard use case simm:
def simm(A,B):
  if A.count() <= 1 and B.count() <= 1:
    a = A.ket()
    b = B.ket()
    if a.label != b.label:
      return 0
    a = max(a.value,0)                        # just making sure they are >= 0.
    b = max(b.value,0)
    if a == 0 and b == 0:                     # prevent div by zero.
      return 0
    return min(a,b)/max(a,b)
  return intersection(A.normalize(),B.normalize()).count_sum()

where, most of the time I use the last one. And I should note it is only the last one that takes into account you should not rescale if length = 1. I didn't bother with the other two versions, since I don't actually use them much at all.

So why is this distinction important, well, some examples:
If you rescale length 1 you get:
simm(|a>,2|a>) = 1
when you really want:
simm(|a>,2|a>) = 0.5
But this is only an issue if both superpositions are length 1. eg, this case has no issues:
simm(|a>,|a> + |b>) = 0.5

I guess that is it for this post! Heaps more to come!

Wednesday 18 February 2015

some examples of list simm in action

So, we have given the code for 3 types of simm in the last post. Now, how about a nice visualisation of them in action. First, some python:

# define square shape:
def square_fn(centre,width,height,x):
  if (centre - width/2) <= x <= (centre + width/2):
    return height
  return 0

# define bell curve:
def bell_fn(centre,width,height,x):
  return height*math.exp(- (x - centre)**2/2*width)

# define Euclidean Distance function:
def ED(f,g):
  if len(f) != len(g):
    print("different length vectors!")
    return 0
  return math.sqrt(sum((f[k] - g[k])**2 for k in range(len(f))))

# define Guassian simm:
# guassian-simm(s,f,g) = exp(-||f - g||^2/2s)
def gauss_simm(s,f,g):
  return math.exp(-ED(f,g)**2/2*s)

# print out the results:
f = [square_fn(15,20,10,k) for k in range(100)]
g = [bell_fn(15,0.01,10,k) for k in range(100)]
print("Gauss simm:",round(gauss_simm(0.001,f,g),4))
print("simm:",round(list_simm([1],f,g),4))
print("scaled simm:",round(rescaled_list_simm([1],f,g),4))

Now, some examples:

f = [square_fn(15,20,10,k) for k in range(100)]
g = [bell_fn(15,0.01,10,k) for k in range(100)]
Gauss simm:0.8583
simm: 0.752
scaled simm: 0.752

f = [square_fn(15,20,10,k) for k in range(100)]
g = [bell_fn(35,0.01,10,k) for k in range(100)]
Gauss simm:0.2208
simm: 0.1698
scaled simm: 0.1698

f = [square_fn(15,20,10,k) for k in range(100)]
g = [bell_fn(75,0.01,10,k) for k in range(100)]
Gauss simm:0.1443
simm: 0.0
scaled simm: 0.0

Now some notes:
1) I used s = 0.001 in the Gaussian simm. So you may want to choose other values for that.
2) I rounded the results to 4 digits
3) In this case simm and scaled simm give essentially the same answer. This is because w*f is pretty close to w*g even without rescaling:
wf: 210
wg: 248.9

OK. Let's redo the examples with a smaller bell curve for curve g (we just drop the height from 10 to 1):

f = [square_fn(15,20,10,k) for k in range(100)]
g = [bell_fn(15,0.01,1,k) for k in range(100)]
Gauss simm: 0.4141
simm: 0.0843
scaled simm: 0.752

f = [square_fn(15,20,10,k) for k in range(100)]
g = [bell_fn(35,0.01,1,k) for k in range(100)]
Gauss simm: 0.3619
simm: 0.0203
scaled simm: 0.1698

f = [square_fn(15,20,10,k) for k in range(100)]
g = [bell_fn(75,0.01,1,k) for k in range(100)]
Gauss simm: 0.3469
simm: 0.0
scaled simm: 0.0

And now the difference between simm and scaled simm is stark. Simm is now much smaller, but scaled simm gives the exact same answer as in the previous examples (as we expect!).

And that's it for this post. I guess a superposition implementation of simm is up next!

Tuesday 17 February 2015

a list implementation of the simm

Briefly, here is the code for a list implementation of simm (as opposed to the superposition version). Note it is unscaled, so we are not rescaling f and g so that w*f == w*g. I may give the code for that later (I don't think I have it typed up anywhere just yet).

Recall:

simm(w,f,g) = (w*f + w*g - w*[f - g])/2.max(w*f,w*g)

Python:

def list_simm(w,f,g):
  the_len = min(len(f),len(g))
#  w += [0] * (the_len - len(w))
  w += [1] * (the_len - len(w))
  f = f[:the_len]
  g = g[:the_len]

  wf = sum(abs(w[k]*f[k]) for k in range(the_len))
  wg = sum(abs(w[k]*g[k]) for k in range(the_len))
  wfg = sum(abs(w[k]*f[k] - w[k]*g[k]) for k in range(the_len))
  
  if wf == 0 and wg == 0:
    return 0
  else:
    return (wf + wg - wfg)/(2*max(wf,wg))

And that's it! Heaps more to come!

Update: OK. I wrote a rescaled version of list simm. I haven't tested it, but I think it is probably right :)
Python:

def rescaled_list_simm(w,f,g):
  the_len = min(len(f),len(g))
# normalize lengths of our lists:
#  w += [0] * (the_len - len(w))
  w += [1] * (the_len - len(w))
  f = f[:the_len]
  g = g[:the_len]

# rescale step, first find size:
  s1 = sum(abs(w[k]*f[k]) for k in range(the_len))
  s2 = sum(abs(w[k]*g[k]) for k in range(the_len))

# if s1 == 0, or s2 == 0, we can't rescale:
  if s1 == 0 or s2 == 0:
    return 0
  
# now rescale:
# we just need w*f == w*g, the exact value doesn't matter, so we choose 1.
# noting that our equation has symmetry under: "f => k.f, g => k.g"
# also, note that finite precision floats means sometimes it does matter, but hopefully we will be fine.
  f = [f[k]/s1 for k in range(the_len)]  
  g = [g[k]/s2 for k in range(the_len)]  
  
# proceed with algo:
# if we did the rescale step correctly we will have:
# wf == wg == 1
#  wf = sum(abs(w[k]*f[k]) for k in range(the_len))
#  wg = sum(abs(w[k]*g[k]) for k in range(the_len))
  wfg = sum(abs(w[k]*f[k] - w[k]*g[k]) for k in range(the_len))
  
# we should never have wf or wg == 0 in the rescaled case:
#  if wf == 0 and wg == 0:
#    return 0
#  else:
#    return (wf + wg - wfg)/(2*max(wf,wg))
  return (2 - wfg)/2

Update: OK. May as well do an implementation of Gaussian simm too:

gaussian-simm(s,f,g) = exp(-||f - g||^2/2s)

Python:

import math

# define Euclidean Distance function:
def ED(f,g):
  if len(f) != len(g):
    print("different length vectors!")
    return 0
  return math.sqrt(sum((f[k] - g[k])**2 for k in range(len(f))))

# define Guassian simm:
# guassian-simm(s,f,g) = exp(-||f - g||^2/2s)
def guass_simm(s,f,g):
  return math.exp(-ED(f,g)**2/2*s)

a similarity metric

In this post I will try to explain the derivation of my similarity metric.This thing is useful in a bunch of places.

First, note that in my travels I have seen this given as a similarity metric:

similarity-metric(f,g,s) = exp(-||f - g||^2/2s)

So, if f == g, it returns 1, if ||f - g||^2 is big, it approaches 0.

Now, for the derivation of my metric.
First, define a multiplication operator that will simplify the equations a little:

a*b = \Sum_k abs(a[k] . b[k]) -- discrete version
a*b = \Int dt abs(a(t) . b(t)) -- continuous version
where . is the standard multiplication operator.

Both of these have the property:

0 <= w*[f - g] <= w*f + w*g
where 0 if f == g, and w*f + w*g if f and g are perfectly disjoint.

which is really just another way of stating a couple of standard properties of metrics:

2. d(x,y) = 0 if and only if x = y
4. d(x,z) <= d(x,y) + d(y,z)

And it is illustrative to note that:

w*[f - g] = \Sum_k |w[k].f[k] - w[k].g[k]|  -- discrete version
w*[f - g] = \Int dt |w(t).f(t) - w(t).g(t)| -- continuous version

Now, just divide through by the right hand side, casting it into the range [0,1]

0 <= w*[f - g]/(w*f + w*g) <= 1

Now, this is a good start, but for some examples it doesn't work as well as expected (maybe one day I will dig up an example to show my point!).
But we can fix this by adding a term: R.abs(w*f - w*g), where R >= 0, and works best with R = 1
And hence simm:

0 <= (w*[f - g] + R.abs(w*f - w*g))/(w*f + w*g + R.abs(w*f - w*g)) <= 1

So, OK. Let's set R = 1, and note that a + b + abs(a - b) = 2.max(a,b), we can now compact it down to:

0 <= (w*[f - g] + abs(w*f - w*g))/2.max(w*f,w*g) <= 1

Cool. And now observe, as in the physics tradition, this equation has some symmetries:

given the scalar k (not equal to 0), we have:
1) symmetry under w => k.w                       
2) symmetry under f => k.f and g => k.g
3) symmetry under w => w.exp(i*t)/R(t), f => R(t).exp(-i*t).f, g => R(t).exp(-i*t).g

Now, I can't think of a use case for the third symmetry there, but I put it in because it reminds me of gauge transformations in physics, and is a good decider on what terms are valid to use in our metric. eg: w*f, w*g, w*[f - g], abs(w*f - w*g).

We are about done. Final thing to note is the above metric returns 0 for exact match, and 1 for perfectly disjoint. If we want to swap this just map the metric m to 1 - m (presuming m "attains" its upper and lower bounds)
ie:

(w*[f - g] + R.abs(w*f - w*g))/(w*f + w*g + R.abs(w*f - w*g))

becomes:

1 - (w*[f - g] + R.abs(w*f - w*g))/(w*f + w*g + R.abs(w*f - w*g))
  = (w*f + w*g + R.abs(w*f - w*g) - w*[f - g] - R.abs(w*f - w*g))/(w*f + w*g + R.abs(w*f - w*g))
  = (w*f + w*g - w*[f - g])/(w*f + w*g + R.abs(w*f - w*g)) 
  = (w*f + w*g - w*[f - g])/2.max(w*f,w*g)                -- assuming R = 1

So there we have it:

simm(w,f,g) = (w*f + w*g - w*[f - g])/2.max(w*f,w*g)
NB: if w is not given, ie we are using simm(f,g), then assume w[k] = 1 for all k, or w(t) = 1 for all t.

Now a couple of observations:
1) the above derivation of my simm was based on the Manhattan metric. A similar derivation could be done using the standard Euclidean metric. Or presumably any metric. Though I think Manhattan is better than Euclidean because it is in some sense linear, and doesn't require any sqrt()s.
2) the fact that the denominator collapses down only when R = 1 is a hint that that is the right choice!
3) from experience playing with this, simm(w,f,g) usually, though not always, gives best results when f and g are rescaled so that w*f == w*g (excluding the case where they are length 1 vectors, in which case you should not rescale!)
4) if f[k] and g[k] are >= 0 for all terms, then the equation can be compressed using a + b - abs(a - b) = 2.min(a,b), and becomes:
simm(w,f,g) = \Sum_k w[k] min(f[k],g[k]) / max(w*f,w*g)

And that's it! An implementation in python, and some examples coming up next.

Update: A quick expansion on observation (1). So, using the Manhattan metric as the base, we have:
w*[f - g] = \Sum_k |w[k].f[k] - w[k].g[k]|
If we used the Eucldiean metric instead, we would have:
w*[f - g] = sqrt(\Sum_k |w[k].f[k] - w[k].g[k]|^2)
and hence my comment "in some sense linear".

And an expansion on observation (4):
Consider just the numerator:
w*f + w*g - w*[f - g]
Now, expand that out using our definition of a*b:

w*f + w*g - w*[f - g]
= \Sum_k |w[k].f[k]| + \Sum_k |w[k].g[k]| - \Sum_k |w[k].f[k] - w[k].g[k]|
= \Sum_k |w[k].f[k]| + |w[k].g[k]| - |w[k].f[k] - w[k].g[k]|
= \Sum_k |w[k]| (|f[k]| + |g[k]| - |f[k] - g[k]|)

Now, make use of "f[k] and g[k] are >= 0 for all terms":

= \Sum_k |w[k]| (f[k] + g[k] - |f[k] - g[k]|)

Now, make use of a + b - abs(a - b) = 2.min(a,b):

= \Sum_k |w[k]| 2.min(f[k],g[k])

And hence (assuming w[k] >= 0 for all k):

simm(w,f,g) = \Sum_k w[k] min(f[k],g[k]) / max(w*f,w*g)

as promised! And we take this pretty compression as a strong hint that Manhattan was a better choice than Euclidean! BTW, also note the factor of 2 from in front of min, and max cancelled out.

Update: OK. I don't know a use case for these, but we can generalize to more than pairs {f,g}.

simm(w,f1,f2) = \Sum_k w[k] min(f1[k],f2[k])/max(w*f1,w*f2)
simm(w,f1,f2,f3) = \Sum_k w[k] min(f1[k],f2[k],f3[k])/max(w*f1,w*f2,w*f3)
simm(w,f1,f2,f3,f4) = \Sum_k w[k] min(f1[k],f2[k],f3[k],f4[k])/max(w*f1,w*f2,w*f3,w*f4)
simm(w,f1,f2,f3,f4,f5) = \Sum_k w[k] min(f1[k],f2[k],f3[k],f4[k],f5[k])/max(w*f1,w*f2,w*f3,w*f4,w*f5)

and so on.

Sunday 15 February 2015

new feature: memoizing rules

So, I was going to write up about "similar" in this phase of the project, but today I implemented a new feature I want to mention: "memoizing rules". Rules that save the results of their computation. I guess the motivation was quantum mechanics. This whole project I have been borrowing/co-opting ideas from QM. bra's, kets, operators, superpositions, exponentiating operators, and so on.

Indeed, stored rules were based on the QM idea that an object doesn't have a property until you measure it. In BKO, say an example like this (making use of stored rules):
is-alive |cat> #=> normalize pick-elt (0.5|yes> + 0.5|no>)

OK. So that is the start of a representation of Schroedinger's poor cat, but what about wave-fn collapse? The idea that before measurement it is in a superposition, but after measurement it is in a single state. Stored rules almost represent this, but not quite. And hence memoizing rules (NB: the "!=>"):
is-alive |cat> !=> normalize pick-elt (0.5|yes> + 0.5|no>)

With this new notation we have both "is-alive |cat>" is undefined until we measure, and once we measure it, it collapses into one state. Hopefully an example in the console will help explain:

-- load up the rules, and see what we have:
sa: dump
----------------------------------------
|context> => |context: memoizing Schrodingers cat>

is-alive |cat> #=> normalize pick-elt (0.5|yes> + 0.5|no>)
memoizing-is-alive |cat> !=> normalize pick-elt (0.5|yes> + 0.5|no>)
----------------------------------------

Now, ask in the console the status of the cat:

sa: is-alive |cat>
|yes>

sa: memoizing-is-alive |cat>
|no>

Now see what we have:

sa: dump
----------------------------------------
|context> => |context: memoizing Schrodingers cat>

is-alive |cat> #=> normalize pick-elt (0.5|yes> + 0.5|no>)
memoizing-is-alive |cat> => |no>
----------------------------------------

So there we have it! Undefined value until measurement, and collapse to a single value after measurement.

OK. So that is cool and all, but what is the use of this? Well, one is to drastically speed up recursive algo's like Fibonacci. Here, let me show:

sa: dump
----------------------------------------
|context> => |context: memoizing Fibonacci>

fib |0> => |0>
fib |1> => |1>
n-1 |*> #=> arithmetic(|_self>,|->,|1>)
n-2 |*> #=> arithmetic(|_self>,|->,|2>)
fib |*> !=> arithmetic(fib n-1 |_self>,|+>,fib n-2 |_self>)
----------------------------------------

-- ask value of fib |13>
sa: fib |13>
|233>
 
-- see what we now know:
sa: dump
----------------------------------------
|context> => |context: memoizing Fibonacci>

fib |0> => |0>
fib |1> => |1>

n-1 |*> #=> arithmetic(|_self>,|->,|1>)
n-2 |*> #=> arithmetic(|_self>,|->,|2>)
fib |*> !=> arithmetic(fib n-1 |_self>,|+>,fib n-2 |_self>)

fib |2> => |1>
fib |3> => |2>
fib |4> => |3>
fib |5> => |5>
fib |6> => |8>
fib |7> => |13>
fib |8> => |21>
fib |9> => |34>
fib |10> => |55>
fib |11> => |89>
fib |12> => |144>
fib |13> => |233>
----------------------------------------

Cool, huh! And is a massive speed up! I've been told normal Fibonacci is O(n!), not sure exactly what the memoizing fib is, but it is closer to O(n). And powerfully enough, we don't really have to do much work to use this auto-memoizing feature. Instead of using stored-rules "#=>", just use the exact same code with a memoizing-rule "!=>".

Update: I just ran fib-ratio |100>, and it was smoking fast! Almost instant. Previously, fib-ratio |30> was sloooow. Code here.

Friday 13 February 2015

Announcing phase 3: similar and find-topic operators

Phase 1 was literal operators, phase 2 was function operators, and now phase 3 we concentrate on a couple of special function operators:
similar[op] |x>
find-topic[op] |x>

I'll start with similar in the next post. Heaps more to come!

Thursday 12 February 2015

This Concludes Phase 2

So, this post concludes phase 2 of my write-up. Phase 1 was largely about (trying) to explain the basics, and concentrating largely on literal operators (recall, defined using OP KET => SUPERPOSITION). Phase 2 was largely about describing and giving some example usages of function operators. Also, I hope, along the way I demonstrated some of the power of the whole BKO scheme.

Where the "BKO scheme" is the idea that you can get a long way just with the idea that everything is a ket, a superposition, or an operator that acts on kets and/or superpositions, and returns kets or superpositions. (and maybe eventually I will include sequences in this scheme. The idea being that superpositions are a space based collection of kets, and sequences are a time based collection of kets. With more thought, another way to look at it is for superpositions the order of the kets is irrelevant, ie kets commute, with sequences the order of the kets is important.)

One consequence of this unified model is you no longer need a new class for each type of object you are working with (the standard Object-Orientated model). Instead we use category/data-type labels to represent that. So instead of needing to define say a person class, we use:
op |person: *> #=> ...
for an operator that works for all members of type person. And:
op |plant: tree: *> #=> ...
for an operator that works for all types of trees.
op |plant: tree: elm> #=> ...
for an operator that works just for trees of type elm.
op |*> #=> ...
for an operator that works for all types (though usually this just means we have been too lazy to specify the data-type!).

Indeed, if we look at the code, we have, and need, only a handful of classes. One for kets, superpositions, stored-rules, context, and context-list. That's it!

Talking of code, at some stage I will link to it. The long term plan is for this to be open source, and for others to bang on it. Of course, currently the code quality is terrible (but mostly works). When I started I was a programming newb, and did terrible things! For the shorter term plan I would love if I could get others on-board to map more data to sw format. And the other side of that, get more people (ie, someone other than me!) to use the sw format.

So world, welcome to the Feynman Knowledge Engine, a minimalist, but working, knowledge engine. And in the back end a custom data representation format (the sw format), and a language to work with it. Feynman being one of my physics hero's!

I have some interesting stuff coming up in phase 3, so as always, heaps more to come!

Update: OK. So I'm not going to link to the code just yet, but I will link to the console history file. The console saves a record of all work done at the console. This is useful in that you can look back and see how you did something, but also useful so others can see how I make use of the BKO scheme.

Update: So, I've been re-reading the Sci Am semantic web paper. In particular, this interested me:
"The real power of the Semantic Web will be realized when people create many programs that collect Web content from diverse sources, process the information and exchange the results with other programs.
...
A typical process will involve the creation of a "value chain" in which subassemblies of information are passed from one agent to another, each one "adding value," to construct the final product requested by the end user."

Strikes me we could do that with not too much work by using the sw format as the method to transfer information from agent to agent. Also strikes me that this will go some way to making the web/internet into a giant brain. Though that is still some way off!

Heh, I have a mental picture of the BKO scheme operators mapping incoming superpositions to out-going superpositions. And agents a level higher than that, mapping incoming sw files to out-going sw files. Interesting!

I guess the next question is what language will these agents be written in? It is kinda clunky to use my back-end python. I've been thinking of a new language called "swc" (ie, semantic-web-code). Not sure exactly what form it will take just yet, but the idea is sw is a subset, so all valid sw is valid in swc, but in contrast to sw it will also have multi-line constructs like loops and functions, and other features like input and output and so on. But I don't really have a mental image of what this full language will look like, yet.

Update: another comment on the power of the BKO scheme.
Say we have n different operators, and we step k steps from our starting superposition, then, there are n^k possible daughter superpositions! Now, not all of these will be useful, or distinct, but as a rough measure n^k is large! The more general, and powerful your operators, the more useful daughter superpositions you will have available to you! Cool!

And if we add 1 new general purpose operator, we now have (n+1)^k - n^k new daughter superpositions.

Indeed, if we have M starting superpositions, n operators, and we take a max of k steps, then the upper bound on the number of distinct and useful superpositions we have is:
M*(1 + n + n^2 + n^3 + ... + n^k-1 + n^k)

Update: we can consider operators to be analogous to tools. Say we are in state:
|hungry> + |have closed can of food>
OK. Not much we can do about solving that.
But if we learn about a "can-opener" (ie, a tool) and an "eat" operator:
sa: can-opener |have closed can of food> => |have open can of food>
sa: eat |have open can of food> => |not hungry> + |have empty can of food>

And now we can do:
sa: eat can-opener (|hungry> + |have closed can of food>)
|not hungry> + |have empty can of food>

Cool huh! And to me indicates BKO is the right notation/language to use!

Update: here is an interesting little observation. If some of your operators commute, ie "op1 op2 SP" = "op2 op1 SP" then we have less distinct daughter superpositions.

brief note on pretty print tables

Just wanted to make a visible note that I tweaked the pretty print table code. We no longer have to do the "extract-value" dance, the table code does that all automatically now. Almost always this is an improvement, though I can think of one case where it might not be. The idea is the column header pretty much indicates the category/data-type, so no need to repeat that for every row in the table.

Old:

sa: load foaf-example-in-sw.sw
sa: table[name,email,works-for] "" |list>
+------------------------+----------------------------------+------------------------+
| name                   | email                            | works-for              |
+------------------------+----------------------------------+------------------------+
| Dan                    | email: danbri@w3.org             | organisation: ILRT     |
| Libby                  | email: libby.miller@bris.ac.uk   | organisation: ILRT     |
| Craig                  | email: craig@netgates.co.uk      | organisation: Netgates |
| Liz                    |                                  | organisation: Netgates |
| Kathleen               |                                  | organisation: Netgates |
| Damian                 |                                  |                        |
| Martin                 | email: m.l.poulter@bristol.ac.uk |                        |
| organisation: ILRT     |                                  |                        |
| organisation: Netgates |                                  |                        |
+------------------------+----------------------------------+------------------------+

New:

sa: load foaf-example-in-sw.sw
sa: table[name,email,works-for] "" |list>
+----------+---------------------------+-----------+
| name     | email                     | works-for |
+----------+---------------------------+-----------+
| Dan      | danbri@w3.org             | ILRT      |
| Libby    | libby.miller@bris.ac.uk   | ILRT      |
| Craig    | craig@netgates.co.uk      | Netgates  |
| Liz      |                           | Netgates  |
| Kathleen |                           | Netgates  |
| Damian   |                           |           |
| Martin   | m.l.poulter@bristol.ac.uk |           |
| ILRT     |                           |           |
| Netgates |                           |           |
+----------+---------------------------+-----------+

Old:

sa: load early-us-presidents.sw
sa: table[name,full-name,president-number,party] "" |early US Presidents: _list>
+------------+---------------------------+------------------+------------------------------+
| name       | full-name                 | president-number | party                        |
+------------+---------------------------+------------------+------------------------------+
| Washington | person: George Washington | number: 1        | party: Independent           |
| Adams      | person: John Adams        | number: 2        | party: Federalist            |
| Jefferson  | person: Thomas Jefferson  | number: 3        | party: Democratic-Republican |
| Madison    | person: James Madison     | number: 4        | party: Democratic-Republican |
| Monroe     | person: James Monroe      | number: 5        | party: Democratic-Republican |
| Q Adams    | person: John Quincy Adams | number: 6        | party: Democratic-Republican |
+------------+---------------------------+------------------+------------------------------+

sa: years-in-office |*> #=> extract-value president-era |_self>
sa: table[name,years-in-office] "" |early US Presidents: _list>
+------------+------------------------------------------------------+
| name       | years-in-office                                      |
+------------+------------------------------------------------------+
| Washington | 1789, 1790, 1791, 1792, 1793, 1794, 1795, 1796, 1797 |
| Adams      | 1797, 1798, 1799, 1800, 1801                         |
| Jefferson  | 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809 |
| Madison    | 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817 |
| Monroe     | 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825 |
| Q Adams    | 1825, 1826, 1827, 1828, 1829                         |
+------------+------------------------------------------------------+

New (NB: we no longer need to define the "years-in-office" operator which applied "extract-value" to the president-era operator):

sa: table[name,full-name,president-number,party] "" |early US Presidents: _list>
+------------+-------------------+------------------+-----------------------+
| name       | full-name         | president-number | party                 |
+------------+-------------------+------------------+-----------------------+
| Washington | George Washington | 1                | Independent           |
| Adams      | John Adams        | 2                | Federalist            |
| Jefferson  | Thomas Jefferson  | 3                | Democratic-Republican |
| Madison    | James Madison     | 4                | Democratic-Republican |
| Monroe     | James Monroe      | 5                | Democratic-Republican |
| Q Adams    | John Quincy Adams | 6                | Democratic-Republican |
+------------+-------------------+------------------+-----------------------+

sa: table[name,president-era] "" |early US Presidents: _list>
+------------+------------------------------------------------------+
| name       | president-era                                        |
+------------+------------------------------------------------------+
| Washington | 1789, 1790, 1791, 1792, 1793, 1794, 1795, 1796, 1797 |
| Adams      | 1797, 1798, 1799, 1800, 1801                         |
| Jefferson  | 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809 |
| Madison    | 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817 |
| Monroe     | 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825 |
| Q Adams    | 1825, 1826, 1827, 1828, 1829                         |
+------------+------------------------------------------------------+

And I guess that's it. The difference between old and improved should be obvious enough.

Update: another tweak of the table code. Now we can dump everything we know about a given superposition without having to manually specify all the operators. Makes use of the "supported-ops" operator.

eg, consider:

sa: load foaf-example-in-sw.sw
sa: supported-ops "" |list>
3.000|op: where-live> + 2.000|op: lives-with> + 4.000|op: email> + 5.000|op: works-for> + |op: wife> + |op: knows-quite-well> + |op: where-lives> + 2.000|op: website> 

sa: load early-us-presidents.sw
sa: supported-ops "" |early US Presidents: _list>
6.000|op: president-number> + 6.000|op: president-era> + 6.000|op: party> + 6.000|op: full-name>

Point is, we can now do this (NB: the "*" instead of an operator label, noting also that * is not a valid operator name):

sa: table[name,*] "" |list>
+----------+--------------+--------------+---------------------------+-----------+------+---------------------------+-------------+----------------------------+
| name     | where-live   | lives-with   | email                     | works-for | wife | knows-quite-well          | where-lives | website                    |
+----------+--------------+--------------+---------------------------+-----------+------+---------------------------+-------------+----------------------------+
| Dan      | Zetland road | Libby, Craig | danbri@w3.org             | ILRT      |      |                           |             |                            |
| Libby    |              |              | libby.miller@bris.ac.uk   | ILRT      |      |                           |             |                            |
| Craig    |              |              | craig@netgates.co.uk      | Netgates  | Liz  |                           |             |                            |
| Liz      | Bristol      | Kathleen     |                           | Netgates  |      |                           |             |                            |
| Kathleen |              |              |                           | Netgates  |      |                           |             |                            |
| Damian   | London       |              |                           |           |      |                           |             |                            |
| Martin   |              |              | m.l.poulter@bristol.ac.uk |           |      | Craig, Damian, Dan, Libby | Bristol     |                            |
| ILRT     |              |              |                           |           |      |                           |             | http://ilrt.org/           |
| Netgates |              |              |                           |           |      |                           |             | http://www.netgates.co.uk/ |
+----------+--------------+--------------+---------------------------+-----------+------+---------------------------+-------------+----------------------------+

sa: table[name,*] "" |early US Presidents: _list>
+------------+------------------+------------------------------------------------------+-----------------------+-------------------+
| name       | president-number | president-era                                        | party                 | full-name         |
+------------+------------------+------------------------------------------------------+-----------------------+-------------------+
| Washington | 1                | 1789, 1790, 1791, 1792, 1793, 1794, 1795, 1796, 1797 | Independent           | George Washington |
| Adams      | 2                | 1797, 1798, 1799, 1800, 1801                         | Federalist            | John Adams        |
| Jefferson  | 3                | 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809 | Democratic-Republican | Thomas Jefferson  |
| Madison    | 4                | 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817 | Democratic-Republican | James Madison     |
| Monroe     | 5                | 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825 | Democratic-Republican | James Monroe      |
| Q Adams    | 6                | 1825, 1826, 1827, 1828, 1829                         | Democratic-Republican | John Quincy Adams |
+------------+------------------+------------------------------------------------------+-----------------------+-------------------+

I suppose I should make a couple of observations.
1) this code highlighted a bug in my foaf-example data-set! Note the "where-lives" instead of "where-live".
2) the "where-live" column shows one example where the auto extract-value on table elements is not ideal. It chomped off the "UK: " prefix. Recall:

where-live |Dan> => |UK: Bristol: Zetland road>
where-live |Liz> => |UK: Bristol>
where-live |Damian> => |UK: London>

I don't think there is much I can do about it though.

I guess that's it. Quite likely I will tweak the table code yet further!

Update: here is a fun, English like, description of supported ops:

sa: load foaf-example-in-sw.sw
sa: list-to-words int-coeffs-to-word extract-value supported-ops "" |list>
|3 where-live, 2 lives-with, 4 email, 5 works-for, 1 wife, 1 knows-quite-well, 1 where-lives and 2 website>

Heh, fun!

new functions: is-greater-than, is-greater-equal-than, etc

I decided a family of comparison functions will be useful, so added them to my function file, and "wired them in" to my processor.

A minimal description (noting there are two variations):

greater-than[t] |x: n> == |x: n> if float(n) > t, else |>
greater-equal-than[t] |x: n> == |x: n> if float(n) >= t, else |>
less-than[t] |x: n> == |x: n> if float(n) < t, else |>
less-equal-than[t] |x: n> == |x: n> if float(n) <= t, else |>
equal[t] |x: n> == |x: n> if  (t - epsilon) <= float(n) <= (t + epsilon), else |>
in-range[t1,t2] |x: n> == |x: n> if t1 <= float(n) <= t2, else |>

And the second variation:

is-greater-than[t] |x: n> == |yes> if float(n) > t, else |no>
is-greater-equal-than[t] |x: n> == |yes> if float(n) >= t, else |no>
is-less-than[t] |x: n> == |yes> if float(n) < t, else |no>
is-less-equal-than[t] |x: n> == |yes> if float(n) <= t, else |no>
is-equal[t] |x: n> == |yes> if  (t - epsilon) <= float(n) <= (t + epsilon), else |no>
is-in-range[t1,t2] |x: n> == |yes> if t1 <= float(n) <= t2, else |no>

They of course share some similarity with functions I already have, such as drop-below, drop-above, and some of the sigmoids.

Anyway, let's compare the "old" way, without these functions vs the new way:
Old:

is-greater-than-0 |*> #=> do-you-know push-float drop-below[1] pop-float |_self>
is-greater-than-5 |*> #=> do-you-know push-float drop-below[5.01] pop-float |_self>
count-name-matches |*> #=> push-float drop-below[2] pop-float extract-value count id |_self>
is-early |time: 24h: *> #=> do-you-know drop-above[700] drop-below[330] pop-float |_self>
is-late |time: 24h: *> #=> do-you-know (drop-below[2230] pop-float |_self> + drop-above[330] pop-float |_self>)
range-is-early |time: 24h: *> #=> do-you-know drop sigmoid-in-range[330,700] pop-float |_self>
range-is-late |time: 24h: *> #=> do-you-know drop (sigmoid-in-range[2230,2400] pop-float |_self> + sigmoid-in-range[0,330] pop-float |_self>)
is-teenager |person: *> #=> do-you-know drop-below[13] drop-above[19] pop-float age |_self>
is-adult |person: *> #=> do-you-know drop-below[18] pop-float age|_self>
range-is-teenager |person: *> #=> do-you-know drop sigmoid-in-range[13,19] pop-float age |_self>

New:

improved-is-greater-than-0 |*> #=> is-greater-than[0] |_self>
improved-is-greater-than-5 |*> #=> is-greater-than[5] |_self>
improved-count-name-matches |*> #=> greater-than[1] count id |_self>
improved-range-is-early |time: 24h: *> #=> is-in-range[330,700] |_self>
improved-range-is-late |time: 24h: *> #=> do-you-know ( in-range[2230,2400] |_self> + in-range[0,330] |_self>)
improved-is-teenager |person: *> #=> is-in-range[13,19] age |_self>
improved-is-adult |person: *> #=> is-greater-equal-than[18] age |_self>

I guess that is about that. Like I said elsewhere, I'm always in favour of improvements that decrease the amount of work, or dancing, you need to do.

That's it for this post. Heaps more to come!

Update: I guess another point to observe is we can transfer not just data, but active rules when we pass around sw files. This I think will be useful!

Update: decided to implement a variant on these, that applies to the coeff of kets. This I guess is just syntactic sugar, so we don't need "is-equal[100] push-float", we can now do: "is-coeff-equal[100]".

coeff-greater-than[t] n|x: y> == n|x: y> if n > t, else |>
coeff-greater-equal-than[t] n|x: y> == n|x: y> if n >= t, else |>
coeff-less-than[t] n|x: y> == |x: y> if n < t, else |>
coeff-less-equal-than[t] n|x: y> == n|x: y> if n <= t, else |>
coeff-equal[t] n|x: y> == n|x: y> if  (t - epsilon) <= n <= (t + epsilon), else |>
coeff-in-range[t1,t2] n|x: y> == n|x: y> if t1 <= n <= t2, else |>

And the second variation:

is-coeff-greater-than[t] n|x: y> == |yes> if n > t, else |no>
is-coeff-greater-equal-than[t] n|x: y> == |yes> if n >= t, else |no>
is-coeff-less-than[t] n|x: y> == |yes> if n < t, else |no>
is-coeff-less-equal-than[t] n|x: y> == |yes> if n <= t, else |no>
is-coeff-equal[t] n|x: y> == |yes> if  (t - epsilon) <= n <= (t + epsilon), else |no>
is-coeff-in-range[t1,t2] n|x: y> == |yes> if t1 <= n <= t2, else |no>

Now I guess the comment comes up, how much is too little and how much is too much syntactic sugar? I guess I lean towards more rather than less.

Tuesday 10 February 2015

XML vs BKO

So today, I want to briefly mention there is also overlap between XML and my sw/BKO notation. And of course, it being my child, I much prefer my notation! One advantage of my notation is that it is trivial to parse just using grep and sed, let alone python say. Another advantage is all the rest of the power of the BKO knowledge engine once in sw format.

So, how about an example?
Here is an XML breakfast menu:

<breakfast_menu>
  <food>
    <name>Belgian Waffles</name>
    <price>$5.95</price>
    <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
    <calories>650</calories>
  </food>
  <food>
    <name>Strawberry Belgian Waffles</name>
    <price>$7.95</price>
    <description>Light Belgian waffles covered with strawberries and whipped cream</description>
    <calories>900</calories>
  </food>
  <food>
    <name>Berry-Berry Belgian Waffles</name>
    <price>$8.95</price>
    <description>Light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
    <calories>900</calories>
  </food>
  <food>
    <name>French Toast</name>
    <price>$4.50</price>
    <description>Thick slices made from our homemade sourdough bread</description>
    <calories>600</calories>
  </food>
  <food>
    <name>Homestyle Breakfast</name>
    <price>$6.95</price>
    <description>Two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
    <calories>950</calories>
  </food>
</breakfast_menu>

And the same thing in BKO (I will admit BKO is probably harder to read than XML for those not familiar with the notation):

|context> => |context: breakfast menu>

 |menu: breakfast: list> => |food: Belgian Waffles> + |food: Strawberry Belgian Waffles> + |food: Berry-Berry Belgian Waffles> + |food: French Toast> + |food: Homestyle Breakfast>

name |food: Belgian Waffles> => |text: "Belgian Waffles">
price |food: Belgian Waffles> => |price: 5.95>
description |food: Belgian Waffles> => |text: "Two of our famous Belgian Waffles with plenty of real maple syrup">
calories |food: Belgian Waffles> => |calories: 650>

name |food: Strawberry Belgian Waffles> => |text: "Strawberry Belgian Waffles">
price |food: Strawberry Belgian Waffles> => |price: 7.95>
description |food: Strawberry Belgian Waffles> => |text: "Light Belgian waffles covered with strawberries and whipped cream">
calories |food: Strawberry Belgian Waffles> => |calories: 900>

name |food: Berry-Berry Belgian Waffles> => |text: "Berry-Berry Belgian Waffles">
price |food: Berry-Berry Belgian Waffles> => |price: 8.95>
description |food: Berry-Berry Belgian Waffles> => |text: "Light Belgian waffles covered with an assortment of fresh berries and whipped cream">
calories |food: Berry-Berry Belgian Waffles> => |calories: 900>

name |food: French Toast> => |text: "French Toast">
price |food: French Toast> => |price: 4.50>
description |food: French Toast> => |text: "Thick slices made from our homemade sourdough bread">
calories |food: French Toast> => |calories: 600>

name |food: Homestyle Breakfast> => |text: "Homestyle Breakfast">
price |food: Homestyle Breakfast> => |price: 6.95>
description |food: Homestyle Breakfast> => |text: "Two eggs, bacon or sausage, toast, and our ever-popular hash browns">
calories |food: Homestyle Breakfast> => |calories: 950>

name |food: Belgian Waffles> => |text: "Belgian Waffles">
price |food: Belgian Waffles> => |price: 5.95>
description |food: Belgian Waffles> => |text: "Two of our famous Belgian Waffles with plenty of real maple syrup">
calories |food: Belgian Waffles> => |calories: 650>

And now, load her up, and print a table (noting the table code has been updated so it always extracts-values. I now think this is the way to go, but the idea is currently on probation):

sa: load clean-breakfast-menu.sw
sa: table[food-name,price,calories,description] "" |menu: breakfast: list>
+-----------------------------+-------+----------+---------------------------------------------------------------------------------------+
| food-name                   | price | calories | description                                                                           |
+-----------------------------+-------+----------+---------------------------------------------------------------------------------------+
| Belgian Waffles             | 5.95  | 650      | "Two of our famous Belgian Waffles with plenty of real maple syrup"                   |
| Strawberry Belgian Waffles  | 7.95  | 900      | "Light Belgian waffles covered with strawberries and whipped cream"                   |
| Berry-Berry Belgian Waffles | 8.95  | 900      | "Light Belgian waffles covered with an assortment of fresh berries and whipped cream" |
| French Toast                | 4.50  | 600      | "Thick slices made from our homemade sourdough bread"                                 |
| Homestyle Breakfast         | 6.95  | 950      | "Two eggs, bacon or sausage, toast, and our ever-popular hash browns"                 |
+-----------------------------+-------+----------+---------------------------------------------------------------------------------------+

And of course, we can sort by price:

table[food,price,calories,description] sort-by[price] "" |menu: breakfast: list>
+-----------------------------+-------+----------+---------------------------------------------------------------------------------------+
| food                        | price | calories | description                                                                           |
+-----------------------------+-------+----------+---------------------------------------------------------------------------------------+
| French Toast                | 4.50  | 600      | "Thick slices made from our homemade sourdough bread"                                 |
| Belgian Waffles             | 5.95  | 650      | "Two of our famous Belgian Waffles with plenty of real maple syrup"                   |
| Homestyle Breakfast         | 6.95  | 950      | "Two eggs, bacon or sausage, toast, and our ever-popular hash browns"                 |
| Strawberry Belgian Waffles  | 7.95  | 900      | "Light Belgian waffles covered with strawberries and whipped cream"                   |
| Berry-Berry Belgian Waffles | 8.95  | 900      | "Light Belgian waffles covered with an assortment of fresh berries and whipped cream" |
+-----------------------------+-------+----------+---------------------------------------------------------------------------------------+

And we can sort by calories:

table[food,price,calories,description] sort-by[calories] "" |menu: breakfast: list>
+-----------------------------+-------+----------+---------------------------------------------------------------------------------------+
| food                        | price | calories | description                                                                           |
+-----------------------------+-------+----------+---------------------------------------------------------------------------------------+
| French Toast                | 4.50  | 600      | "Thick slices made from our homemade sourdough bread"                                 |
| Belgian Waffles             | 5.95  | 650      | "Two of our famous Belgian Waffles with plenty of real maple syrup"                   |
| Strawberry Belgian Waffles  | 7.95  | 900      | "Light Belgian waffles covered with strawberries and whipped cream"                   |
| Berry-Berry Belgian Waffles | 8.95  | 900      | "Light Belgian waffles covered with an assortment of fresh berries and whipped cream" |
| Homestyle Breakfast         | 6.95  | 950      | "Two eggs, bacon or sausage, toast, and our ever-popular hash browns"                 |
+-----------------------------+-------+----------+---------------------------------------------------------------------------------------+

And that's it for this post. Heaps more to come!

Update: I don't know. Perhaps someone smarter than I could write an XML => sw, and a sw => XML transform code?

Thursday 5 February 2015

another simple one, aiming towards natural language

sa: table[number,is-greater-than-0,is-greater-than-5] range(|-5>,|10>)
+--------+-------------------+-------------------+
| number | is-greater-than-0 | is-greater-than-5 |
+--------+-------------------+-------------------+
| -5     | no                | no                |
| -4     | no                | no                |
| -3     | no                | no                |
| -2     | no                | no                |
| -1     | no                | no                |
| 0      | no                | no                |
| 1      | yes               | no                |
| 2      | yes               | no                |
| 3      | yes               | no                |
| 4      | yes               | no                |
| 5      | yes               | no                |
| 6      | yes               | yes               |
| 7      | yes               | yes               |
| 8      | yes               | yes               |
| 9      | yes               | yes               |
| 10     | yes               | yes               |
+--------+-------------------+-------------------+

And we can also easily enough count the number greater than 0 in that range:
(makes use of linearity of literal operators, and that kets add)
sa: is-greater-than-0 range(|-5>,|10>)
6.000|no> + 10.000|yes>

And how many are greater than 5:
sa: is-greater-than-5 range(|-5>,|10>)
11.000|no> + 5.000|yes>

Update: a table version of "is-happy":

sa: is-happy |*> #=> if(is-whistling |_self>,|yes>,|>)
sa: is-whistling |Fred> => |yes>
sa: do-you-know-is-happy |*> #=> do-you-know is-happy |_self>
sa: table[name,do-you-know-is-happy,is-happy,is-whistling] (|Fred> + |Sam>) 
+------+----------------------+----------+--------------+
| name | do-you-know-is-happy | is-happy | is-whistling |
+------+----------------------+----------+--------------+
| Fred | yes                  | yes      | yes          |
| Sam  | no                   |          |              |
+------+----------------------+----------+--------------+

SPARQL vs BKO

So I've been doing a little bit of reading about dbpedia, since it has a lot of overlap with what I am trying to do! Amongst other things there, there are a few SPARQL examples. For example, "cities with more than 2 million habitants".

First in SPARQL:
(name and population for the top 30 biggest cities)

SELECT ?subject ?population WHERE {
?subject rdf:type <http://dbpedia.org/ontology/City>.
?subject <http://dbpedia.org/ontology/populationUrban> ?population.
FILTER (xsd:integer(?population) > 2000000)
}
ORDER BY DESC(xsd:integer(?population))
LIMIT 30

Then in BKO:

-- load up geonames data for all cities with population >= 15000:
sa: load improved-geonames-cities-15000.sw

-- restrict to cities with population >= 2,000,000, and sort the result:
sa: |result> => coeff-sort drop-below[2000000] population-self relevant-kets[population-self] |>

-- define population-comma operator (tidies our population results a little):
sa: population-comma |*> #=> to-comma-number extract-value population |_self>

-- how many results:
sa: how-many "" |result>
|number: 149>

-- print rank table of top 30 results:
sa: rank-table[id,name,population-comma] select[1,30] "" |result>
+------+-------------+----------------+------------------+
| rank | id          | name           | population-comma |
+------+-------------+----------------+------------------+
| 1    | id: 1796236 | Shanghai       | 22,315,474       |
| 2    | id: 3435910 | Buenos Aires   | 13,076,300       |
| 3    | id: 1275339 | Mumbai         | 12,691,836       |
| 4    | id: 3530597 | Mexico City    | 12,294,193       |
| 5    | id: 1816670 | Beijing        | 11,716,620       |
| 6    | id: 1174872 | Karachi        | 11,624,219       |
| 7    | id: 745044  | Istanbul       | 11,174,257       |
| 8    | id: 1792947 | Tianjin        | 11,090,314       |
| 9    | id: 1809858 | Guangzhou      | 11,071,424       |
| 10   | id: 1273294 | Delhi          | 10,927,986       |
| 11   | id: 1701668 | Manila         | 10,444,527       |
| 12   | id: 524901  | Moscow         | 10,381,222       |
| 13   | id: 1795565 | Shenzhen       | 10,358,381       |
| 14   | id: 1185241 | Dhaka          | 10,356,500       |
| 15   | id: 1835848 | Seoul          | 10,349,312       |
| 16   | id: 3448439 | Sao Paulo      | 10,021,295       |
| 17   | id: 1791247 | Wuhan          | 9,785,388        |
| 18   | id: 2332459 | Lagos          | 9,000,000        |
| 19   | id: 1642911 | Jakarta        | 8,540,121        |
| 20   | id: 1850147 | Tokyo          | 8,336,599        |
| 21   | id: 5128581 | New York City  | 8,175,133        |
| 22   | id: 1812545 | Dongguan       | 8,000,000        |
| 23   | id: 1668341 | Taipei         | 7,871,900        |
| 24   | id: 2314302 | Kinshasa       | 7,785,965        |
| 25   | id: 3936456 | Lima           | 7,737,002        |
| 26   | id: 360630  | Cairo          | 7,734,614        |
| 27   | id: 3688689 | Bogota         | 7,674,366        |
| 28   | id: 2643741 | City of London | 7,556,900        |
| 29   | id: 2643743 | London         | 7,556,900        |
| 30   | id: 1814906 | Chongqing      | 7,457,600        |
+------+-------------+----------------+------------------+

Of mild note is some populations here are wildly different from the dbpedia results. Different ways of counting population (eg, center of a city, vs a broader definition) I presume.

I guess the other thing about this post is I'm not a big fan of the SPARQL notation. I much prefer my notation, but heh, I'm kinda biased.

The other thing to note is I now have 4 permutations of the table operator: table[], strict-table[], rank-table[], and strict-rank-table[]. Also, I tweaked table[] so that it auto-sets all coeffs of the incoming superposition to 1 (using the set-to[1] sigmoid). I figure there are no use cases for when we want non 1 coeffs for the incoming superposition. If I'm wrong, then just have to comment out one line of code.

Update: what if we want a table without the "id" column? Well, not super easy. We have to do it indirectly again. And also noting that in the geonames data, id is unique, but names are not. Anyway, our code neatly handles ambiguity like that.

Now, using the same |result> as defined above, let's take a quick look at what we know:

sa: select[1,5] "" |result>
22315474.000|id: 1796236> + 13076300.000|id: 3435910> + 12691836.000|id: 1275339> + 12294193.000|id: 3530597> + 11716620.000|id: 1816670>

sa: name select[1,5] "" |result>
22315474.000|Shanghai> + 13076300.000|Buenos Aires> + 12691836.000|Mumbai> + 12294193.000|Mexico City> + 11716620.000|Beijing>

-- now define some operators:
-- define the pop operator (NB: the "id" just before |_self> that maps back from "name" to geoname "id"):
popn |*> #=> to-comma-number extract-value population id |_self>

-- define improved popn operator (sorts population results):
improved-popn |*> #=> to-comma-number extract-value population reverse sort-by[population] id |_self>

-- a variant of that is this (makes use of list-to-words):
improved-popn |*> #=> list-to-words to-comma-number extract-value population reverse sort-by[population] id |_self>

-- define the "count-name-matches" operator, which tells us the number of different objects sharing a name:
-- (defined to ignore value if count == 1) 
count-name-matches |*> #=> push-float drop-below[2] pop-float extract-value count id |_self>

-- show the resulting table:
sa: rank-table[city,improved-popn,count-name-matches] name "" |result>
+------+----------------------+------------------------------------+--------------------+
| rank | city                 | improved-popn                      | count-name-matches |
+------+----------------------+------------------------------------+--------------------+
| 1    | Shanghai             | 22,315,474                         |                    |
| 2    | Buenos Aires         | 13,076,300                         |                    |
| 3    | Mumbai               | 12,691,836                         |                    |
| 4    | Mexico City          | 12,294,193                         |                    |
| 5    | Beijing              | 11,716,620                         |                    |
| 6    | Karachi              | 11,624,219                         |                    |
| 7    | Istanbul             | 11,174,257                         |                    |
| 8    | Tianjin              | 11,090,314                         |                    |
| 9    | Guangzhou            | 11,071,424                         |                    |
| 10   | Delhi                | 10,927,986                         |                    |
| 11   | Manila               | 10,444,527                         |                    |
| 12   | Moscow               | 10,381,222, 23,800                 | 2                  |
| 13   | Shenzhen             | 10,358,381                         |                    |
| 14   | Dhaka                | 10,356,500, 36,111                 | 2                  |
| 15   | Seoul                | 10,349,312                         |                    |
| 16   | Sao Paulo            | 10,021,295                         |                    |
| 17   | Wuhan                | 9,785,388                          |                    |
| 18   | Lagos                | 9,000,000, 18,831                  | 2                  |
| 19   | Jakarta              | 8,540,121                          |                    |
| 20   | Tokyo                | 8,336,599                          |                    |
| 21   | New York City        | 8,175,133                          |                    |
| 22   | Dongguan             | 8,000,000                          |                    |
| 23   | Taipei               | 7,871,900                          |                    |
| 24   | Kinshasa             | 7,785,965                          |                    |
| 25   | Lima                 | 7,737,002, 38,771                  | 2                  |
| 26   | Cairo                | 7,734,614                          |                    |
| 27   | Bogota               | 7,674,366                          |                    |
| 28   | City of London       | 7,556,900                          |                    |
| 29   | London               | 7,556,900, 346,765                 | 2                  |
| 30   | Chongqing            | 7,457,600                          |                    |
| 31   | Chengdu              | 7,415,590                          |                    |
| 32   | Nanjing              | 7,165,292                          |                    |
| 33   | Tehran               | 7,153,309                          |                    |
| 34   | Nanchong             | 7,150,000                          |                    |
| 35   | Hong Kong            | 7,012,738                          |                    |
| 36   | Xi'an                | 6,501,190                          |                    |
| 37   | Lahore               | 6,310,888                          |                    |
| 38   | Shenyang             | 6,255,921                          |                    |
| 39   | Hangzhou             | 6,241,971                          |                    |
| 40   | Rio de Janeiro       | 6,023,699                          |                    |
| 41   | Harbin               | 5,878,939                          |                    |
| 42   | Baghdad              | 5,672,513                          |                    |
| 43   | Tai'an               | 5,499,000                          |                    |
| 44   | Suzhou               | 5,345,961, 205,130                 | 2                  |
| 45   | Shantou              | 5,329,024                          |                    |
| 46   | Bangkok              | 5,104,476                          |                    |
| 47   | Bangalore            | 5,104,047                          |                    |
| 48   | Saint Petersburg     | 5,028,000, 244,769                 | 2                  |
| 49   | Santiago             | 4,837,295, 108,414, 46,611, 35,968 | 4                  |
| 50   | Kolkata              | 4,631,392                          |                    |
| 51   | Sydney               | 4,627,345, 105,968                 | 2                  |
| 52   | Toronto              | 4,612,191                          |                    |
| 53   | Yangon               | 4,477,638                          |                    |
| 54   | Jinan                | 4,335,989                          |                    |
| 55   | Chennai              | 4,328,063                          |                    |
| 56   | Zhengzhou            | 4,253,913                          |                    |
| 57   | Melbourne            | 4,246,375, 76,068                  | 2                  |
| 58   | Riyadh               | 4,205,961                          |                    |
| 59   | Changchun            | 4,193,073                          |                    |
| 60   | Dalian               | 4,087,733                          |                    |
| 61   | Chittagong           | 3,920,222                          |                    |
| 62   | Kunming              | 3,855,346                          |                    |
| 63   | Alexandria           | 3,811,516, 139,966, 49,346, 47,723 | 4                  |
| 64   | Los Angeles          | 3,792,621, 125,430                 | 2                  |
| 65   | Ahmedabad            | 3,719,710                          |                    |
| 66   | Qingdao              | 3,718,835                          |                    |
| 67   | Busan                | 3,678,555                          |                    |
| 68   | Abidjan              | 3,677,115                          |                    |
| 69   | Kano                 | 3,626,068                          |                    |
| 70   | Foshan               | 3,600,000                          |                    |
| 71   | Hyderabad            | 3,597,816, 1,386,330               | 2                  |
| 72   | Puyang               | 3,590,000                          |                    |
| 73   | Yokohama             | 3,574,443                          |                    |
| 74   | Ibadan               | 3,565,108                          |                    |
| 75   | Singapore            | 3,547,809                          |                    |
| 76   | Wuxi                 | 3,543,719, 66,442                  | 2                  |
| 77   | Xiamen               | 3,531,347                          |                    |
| 78   | Ankara               | 3,517,182                          |                    |
| 79   | Tianshui             | 3,500,000                          |                    |
| 80   | Ningbo               | 3,491,597                          |                    |
| 81   | Ho Chi Minh City     | 3,467,331                          |                    |
| 82   | Shiyan               | 3,460,000, 408,055                 | 2                  |
| 83   | Cape Town            | 3,433,441                          |                    |
| 84   | Taiyuan              | 3,426,519                          |                    |
| 85   | Berlin               | 3,426,354                          |                    |
| 86   | Tangshan             | 3,372,102                          |                    |
| 87   | Hefei                | 3,310,268                          |                    |
| 88   | Montreal             | 3,268,513                          |                    |
| 89   | Madrid               | 3,255,944, 50,437                  | 2                  |
| 90   | Pyongyang            | 3,222,000                          |                    |
| 91   | Casablanca           | 3,144,909                          |                    |
| 92   | Zibo                 | 3,129,228                          |                    |
| 93   | Zhongshan            | 3,121,275                          |                    |
| 94   | Durban               | 3,120,282                          |                    |
| 95   | Changsha             | 3,093,980                          |                    |
| 96   | Kabul                | 3,043,532                          |                    |
| 97   | UEruemqi             | 3,029,372                          |                    |
| 98   | Caracas              | 3,000,000                          |                    |
| 99   | Pune                 | 2,935,744                          |                    |
| 100  | Surat                | 2,894,504                          |                    |
| 101  | Jeddah               | 2,867,446                          |                    |
| 102  | Shijiazhuang         | 2,834,942                          |                    |
| 103  | Kanpur               | 2,823,249                          |                    |
| 104  | Kiev                 | 2,797,553                          |                    |
| 105  | Luanda               | 2,776,168                          |                    |
| 106  | Quezon City          | 2,761,720                          |                    |
| 107  | Addis Ababa          | 2,757,729                          |                    |
| 108  | Nairobi              | 2,750,547                          |                    |
| 109  | Salvador             | 2,711,840                          |                    |
| 110  | Jaipur               | 2,711,758                          |                    |
| 111  | Dar es Salaam        | 2,698,652                          |                    |
| 112  | Chicago              | 2,695,598                          |                    |
| 113  | Lanzhou              | 2,628,426                          |                    |
| 114  | Incheon              | 2,628,000                          |                    |
| 115  | Yunfu                | 2,612,800                          |                    |
| 116  | Navi Mumbai          | 2,600,000                          |                    |
| 117  | Al Basrah            | 2,600,000                          |                    |
| 118  | Osaka-shi            | 2,592,413                          |                    |
| 119  | Mogadishu            | 2,587,183                          |                    |
| 120  | Daegu                | 2,566,540                          |                    |
| 121  | Faisalabad           | 2,506,595                          |                    |
| 122  | Izmir                | 2,500,603                          |                    |
| 123  | Dakar                | 2,476,400                          |                    |
| 124  | Lucknow              | 2,472,011                          |                    |
| 125  | Al Jizah             | 2,443,203                          |                    |
| 126  | Fortaleza            | 2,400,000                          |                    |
| 127  | Cali                 | 2,392,877                          |                    |
| 128  | Surabaya             | 2,374,658                          |                    |
| 129  | Belo Horizonte       | 2,373,224                          |                    |
| 130  | Nanchang             | 2,357,839                          |                    |
| 131  | Grand Dakar          | 2,352,057                          |                    |
| 132  | Rome                 | 2,318,895, 36,303, 33,725          | 3                  |
| 133  | Mashhad              | 2,307,177                          |                    |
| 134  | Brooklyn             | 2,300,664                          |                    |
| 135  | Borough of Queens    | 2,272,771                          |                    |
| 136  | Nagpur               | 2,228,018                          |                    |
| 137  | Maracaibo            | 2,225,000                          |                    |
| 138  | Brasilia             | 2,207,718                          |                    |
| 139  | Santo Domingo        | 2,201,941, 45,476                  | 2                  |
| 140  | Nagoya-shi           | 2,191,279                          |                    |
| 141  | Brisbane             | 2,189,878                          |                    |
| 142  | Havana               | 2,163,824                          |                    |
| 143  | Paris                | 2,138,551, 25,171                  | 2                  |
| 144  | Houston              | 2,099,451                          |                    |
| 145  | Al Mawsil al Jadidah | 2,065,597                          |                    |
| 146  | Johannesburg         | 2,026,469                          |                    |
| 147  | Kowloon              | 2,019,533                          |                    |
| 148  | Al Basrat al Qadimah | 2,015,483                          |                    |
| 149  | Almaty               | 2,000,900                          |                    |
+------+----------------------+------------------------------------+--------------------+

And that's it for now. Heaps more to come, as usual.

Tuesday 3 February 2015

African capital cities and population in table format

Bah! Only got part-way through this one, but I'll post what I have. I wanted to load up some data to pretty print a table of African countries and their capital cities. So I had that working, and then thought why not population too? This time from wikipedia. Problem is the two data-sets have some differences in naming countries (yeah, I guess this is going to be a problem in general for the semantic web). Then I got greedy and thought why not a more comprehensive data set for Africa. Looked again at wikipedia, gave it some thought, and changed my mind. Too much work. Almost need to write a custom parser. Too lazy for that. So I gave up. Here is what I do have:

sa: load africa.sw
sa: table[name,capital-city,population] "" |Africa: country: list>
+---------------------------------------+--------------------+-----------------------+
| name                                  | capital-city       | population            |
+---------------------------------------+--------------------+-----------------------+
| country: Algeria                      | city: Algiers      | population: 39903000  |
| country: Angola                       | city: Luanda       | population: 25326000  |
| country: Benin                        | city: Porto-Novo   | population: 10750000  |
| country: Botswana                     | city: Gaborone     | population: 2176000   |
| country: Burkina Faso                 | city: Ouagadougou  | population: 18477000  |
| country: Burundi                      | city: Bujumbura    | population: 9824000   |
| country: Cameroon                     | city: Yaounde      | population: 21918000  |
| country: Cape Verde                   | city: Praia        | population: 525000    |
| country: Central African Republic     | city: Bangui       | population: 5545000   |
| country: Chad                         | city: N'Djamena    | population: 13675000  |
| country: Cote d'Ivoire                | city: Yamoussoukro |                       |
| country: Democratic Republic of Congo | city: Kinshasa     |                       |
| country: Egypt                        | city: Cairo        | population: 88523000  |
| country: Equatorial Guinea            | city: Malabo       | population: 1996000   |
| country: Eritrea                      | city: Asmara       | population: 6895000   |
| country: Ethiopia                     | city: AddisAbaba   | population: 90076000  |
| country: Gabon                        | city: Libreville   | population: 2382000   |
| country: Ghana                        | city: Accra        | population: 27714000  |
| country: Kenya                        | city: Nairobi      | population: 44153000  |
| country: Lesotho                      | city: Maseru       | population: 1908000   |
| country: Liberia                      | city: Monrovia     | population: 4046000   |
| country: Libya                        | city: Tripoli      | population: 6521000   |
| country: Madagascar                   | city: Antananarivo | population: 23053000  |
| country: Malawi                       | city: Lilongwe     | population: 16307000  |
| country: Mali                         | city: Bamako       | population: 17796000  |
| country: Mauritania                   | city: Nouakchott   | population: 3632000   |
| country: Mauritius                    | city: Port Louis   | population: 1263000   |
| country: Morocco                      | city: Rabat        | population: 33656000  |
| country: Mozambique                   | city: Maputo       | population: 25728000  |
| country: Niger                        | city: Niamey       | population: 18880000  |
| country: Nigeria                      | city: Abuja        | population: 185043000 |
| country: Republic of Congo            | city: Brazzaville  |                       |
| country: Republic of Djibouti         | city: Djibouti     |                       |
| country: Republic of Guinea           | city: Conakry      |                       |
| country: Republic of Namibia          | city: Windhoek     |                       |
| country: Republic of South Sudan      | city: Juba         |                       |
| country: Republic of Sudan            | city: Khartoum     |                       |
| country: Republic of Tunisia          | city: Tunis        |                       |
| country: Rwanda                       | city: Kigali       | population: 11324000  |
| country: Sao Tome and Principe        | city: Sao Tome     |                       |
| country: Senegal                      | city: Dakar        | population: 14150000  |
| country: Seychelles                   | city: Victoria     | population: 97000     |
| country: Sierra Leone                 | city: Freetown     | population: 6513000   |
| country: Somalia                      | city: Mogadishu    | population: 10972000  |
| country: South Africa                 | city: Pretoria     | population: 54844000  |
| country: Swaziland                    | city: Mbabane      | population: 1097000   |
| country: Tanzania                     | city: Dodoma       | population: 48829000  |
| country: The Gambia                   | city: Banjul       |                       |
| country: Togo                         | city: Lome         | population: 7065000   |
| country: Uganda                       | city: Kampala      | population: 35760000  |
| country: Union of Comoros             | city: Moroni       |                       |
| country: Western Sahara               | city: El Aaiun     |                       |
| country: Zambia                       | city: Lusaka       | population: 15474000  |
| country: Zimbabwe                     | city: Harare       | population: 13503000  |
+---------------------------------------+--------------------+-----------------------+

And I guess that is it for now.

Update: if we don't want "country: ", "city: " and "population: " in there we have to do a dance. At first guess you (I certainly did) might try "extract-value" applied to the incoming superposition, to strip the "country: " prefix. But here is what happens (top 5 results for brevity sake):

sa: table[country,capital-city,population] extract-value select[1,5] "" |Africa: country: list>
+--------------+--------------+------------+
| country      | capital-city | population |
+--------------+--------------+------------+
| Algeria      |              |            |
| Angola       |              |            |
| Benin        |              |            |
| Botswana     |              |            |
| Burkina Faso |              |            |
+--------------+--------------+------------+

So why did this happen? Because we have:

capital-city |country: Algeria> => |city: Algiers>
population |country: Algeria> => |population: 39903000>

but we have no knowledge of "capital-city" and "population" applied to kets without the "country: " data-type. ie, these are undefined:

capital-city |Algeria> => ...
population |Algeria> => ...

So this is where the dancing kicks in (NB: the merge-labels() that re-inserts the "country: " data-type):

capital |*> #=> extract-value capital-city merge-labels(|country: > + |_self>)
popn |*> #=> to-comma-number extract-value population merge-labels(|country: > + |_self>)

And now the tidied table:

sa: table[country,capital,popn] extract-value "" |Africa: country: list>
+------------------------------+--------------+-------------+
| country                      | capital      | popn        |
+------------------------------+--------------+-------------+
| Algeria                      | Algiers      | 39,903,000  |
| Angola                       | Luanda       | 25,326,000  |
| Benin                        | Porto-Novo   | 10,750,000  |
| Botswana                     | Gaborone     | 2,176,000   |
| Burkina Faso                 | Ouagadougou  | 18,477,000  |
| Burundi                      | Bujumbura    | 9,824,000   |
| Cameroon                     | Yaounde      | 21,918,000  |
| Cape Verde                   | Praia        | 525,000     |
| Central African Republic     | Bangui       | 5,545,000   |
| Chad                         | N'Djamena    | 13,675,000  |
| Cote d'Ivoire                | Yamoussoukro |             |
| Democratic Republic of Congo | Kinshasa     |             |
| Egypt                        | Cairo        | 88,523,000  |
| Equatorial Guinea            | Malabo       | 1,996,000   |
| Eritrea                      | Asmara       | 6,895,000   |
| Ethiopia                     | AddisAbaba   | 90,076,000  |
| Gabon                        | Libreville   | 2,382,000   |
| Ghana                        | Accra        | 27,714,000  |
| Kenya                        | Nairobi      | 44,153,000  |
| Lesotho                      | Maseru       | 1,908,000   |
| Liberia                      | Monrovia     | 4,046,000   |
| Libya                        | Tripoli      | 6,521,000   |
| Madagascar                   | Antananarivo | 23,053,000  |
| Malawi                       | Lilongwe     | 16,307,000  |
| Mali                         | Bamako       | 17,796,000  |
| Mauritania                   | Nouakchott   | 3,632,000   |
| Mauritius                    | Port Louis   | 1,263,000   |
| Morocco                      | Rabat        | 33,656,000  |
| Mozambique                   | Maputo       | 25,728,000  |
| Niger                        | Niamey       | 18,880,000  |
| Nigeria                      | Abuja        | 185,043,000 |
| Republic of Congo            | Brazzaville  |             |
| Republic of Djibouti         | Djibouti     |             |
| Republic of Guinea           | Conakry      |             |
| Republic of Namibia          | Windhoek     |             |
| Republic of South Sudan      | Juba         |             |
| Republic of Sudan            | Khartoum     |             |
| Republic of Tunisia          | Tunis        |             |
| Rwanda                       | Kigali       | 11,324,000  |
| Sao Tome and Principe        | Sao Tome     |             |
| Senegal                      | Dakar        | 14,150,000  |
| Seychelles                   | Victoria     | 97,000      |
| Sierra Leone                 | Freetown     | 6,513,000   |
| Somalia                      | Mogadishu    | 10,972,000  |
| South Africa                 | Pretoria     | 54,844,000  |
| Swaziland                    | Mbabane      | 1,097,000   |
| Tanzania                     | Dodoma       | 48,829,000  |
| The Gambia                   | Banjul       |             |
| Togo                         | Lome         | 7,065,000   |
| Uganda                       | Kampala      | 35,760,000  |
| Union of Comoros             | Moroni       |             |
| Western Sahara               | El Aaiun     |             |
| Zambia                       | Lusaka       | 15,474,000  |
| Zimbabwe                     | Harare       | 13,503,000  |
+------------------------------+--------------+-------------+

And one more for luck! Top 10 countries in Africa sorted by population, in a rank table, with "country: ", "city: ", and "population: " removed:

sa: rank-table[country,capital,popn] select[1,10] extract-value reverse sort-by[population] "" |Africa: country: list> 
+------+--------------+------------+-------------+
| rank | country      | capital    | popn        |
+------+--------------+------------+-------------+
| 1    | Nigeria      | Abuja      | 185,043,000 |
| 2    | Ethiopia     | AddisAbaba | 90,076,000  |
| 3    | Egypt        | Cairo      | 88,523,000  |
| 4    | South Africa | Pretoria   | 54,844,000  |
| 5    | Tanzania     | Dodoma     | 48,829,000  |
| 6    | Kenya        | Nairobi    | 44,153,000  |
| 7    | Algeria      | Algiers    | 39,903,000  |
| 8    | Uganda       | Kampala    | 35,760,000  |
| 9    | Morocco      | Rabat      | 33,656,000  |
| 10   | Ghana        | Accra      | 27,714,000  |
+------+--------------+------------+-------------+

And I think that is it for this post. I'm slowly learning the best way to interact with tables.

Update: the table code now auto-applies "extract-value" operator to table elements. Now, we no longer need to "do a dance" to remove category text cluttering up our tables. Now simply:

sa: load africa.sw
sa: popn |*> #=> to-comma-number population |_self>
sa: rank-table[country,capital-city,popn] select[1,10] reverse sort-by[population] "" |Africa: country: list>
+------+--------------+--------------+-------------+
| rank | country      | capital-city | popn        |
+------+--------------+--------------+-------------+
| 1    | Nigeria      | Abuja        | 185,043,000 |
| 2    | Ethiopia     | AddisAbaba   | 90,076,000  |
| 3    | Egypt        | Cairo        | 88,523,000  |
| 4    | South Africa | Pretoria     | 54,844,000  |
| 5    | Tanzania     | Dodoma       | 48,829,000  |
| 6    | Kenya        | Nairobi      | 44,153,000  |
| 7    | Algeria      | Algiers      | 39,903,000  |
| 8    | Uganda       | Kampala      | 35,760,000  |
| 9    | Morocco      | Rabat        | 33,656,000  |
| 10   | Ghana        | Accra        | 27,714,000  |
+------+--------------+--------------+-------------+

I'm always in favour of minimising how much work you need to do. So I think this is a big improvement! No dancing just to tidy up the table a little.