Friday 27 March 2015

mapping sw files to frequency lists

So the idea is, we have a big, and rapidly growing, collection of sw files, maybe we can use find-topic[op] to find which sw file we need. I guess a kind of sorted, fuzzy grep. All we need to do is map them to frequency lists, and we are done. Here is the code for that, and here is the resulting sw file. Note for speed we currently only process sw files that are under 10 KB in size. Maybe we will relax that at some point.

Now in the console:
-- load the data:
sa: load sw-files-to-frequency-lists.sw

-- save some typing:
sa: h |*> #=> find-topic[kets] |_self>

-- find sw files that mention frogs:
sa: h |animal: frog>
75.0|sw file: frog.sw> + 25|sw file: spell-active-buffer.sw>

-- find sw files that mention George:
sa: h |george>
66.667|sw file: similar-chars-example.sw> + 33.333|sw file: matrix-as-network.sw>

sa: h |person: George>
27.273|sw file: blog-george.sw> + 27.273|sw file: george.sw> + 27.273|sw file: new-george.sw> + 18.182|sw file: recall-general-rules-example.sw>

-- find sw files that mention bot: Emma:
sa: h |bot: Emma>
50|sw file: bot-emma.sw> + 50|sw file: bots.sw>

-- find sw files that use |op: fib>:
sa: h |op: fib>
16.667|sw file: active-fib-play.sw> + 16.667|sw file: fib-play.sw> + 16.667|sw file: fibonacci.sw> + 16.667|sw file: memoizing-fibonacci.sw> + 16.667|sw file: next-fib-play.sw> + 16.667|sw file: small-fib.sw>

-- find sw files that mention |food: Belgian Waffles>:
sa: h |food: Belgian Waffles>
31.746|sw file: clean-breakfast-menu.sw> + 25.397|sw file: breakfast-menu.sw> + 23.81|sw file: breaky-presidents.sw> + 19.048|sw file: next-breakfast-menu.sw>

-- find sw files that mention |word: btw>:
sa: h |word: btw>
100|sw file: internet-acronyms.sw>

-- find sw files that mention |document: www proposal>
sa: h |document: www proposal>
100|sw file: www-proposal.sw>
So I guess it works as expected. Not sure it is that much of an improvement over grep. Yeah, we do have coeffs in there showing which sw file is more relevant, but still.

I guess also if we map source code to frequency lists, we could use this to search for the right code file. But like I just said, not sure it is that much better than grep. Heh, maybe a little worse in some cases.

That is it for this post. Another find-topic[op] example in the next post.

Update: so there is room for improvement in here. One is to borrow Google's "did you mean" feature. Currently you have to get the ket exactly right (including capitalization) else you get nothing. With some work we should be able to use a fuzzier ket search. The key component of that would of course be our friend similar[op].

Wednesday 25 March 2015

find-unique[op] applied to webpage superpositions

Recall back here we mapped sample websites to superpositions, and then did pattern recognition on them. Well, using find-unique[op] we can give a vastly better result!
-- load up the data:
sa: load improved-fragment-webpages.sw
sa: load create-average-website-fragments.sw
Here is the tweaked BKO that makes use of find-unique[op]:
(this is inside this file)
-- define the list of average websites:
|ave list> => |average abc> + |average adelaidenow> + |average slashdot> + |average smh> + |average wikipedia> + |average youtube>

-- we want average hash to be distinct from the other hashes:
|null> => map[hash-4B,average-hash-4B] "" |ave list>

-- find unique kets for our average superpositions:
|null> => find-unique[average-hash-4B] |>

-- now, let's see how well these patterns recognize the pages we left out of our average:
result |abc 11> => 100 similar[hash-4B,unique-average-hash-4B] |abc 11>
result |adelaidenow 11> => 100 similar[hash-4B,unique-average-hash-4B] |adelaidenow 11>
result |slashdot 11> => 100 similar[hash-4B,unique-average-hash-4B] |slashdot 11>
result |smh 11> => 100 similar[hash-4B,unique-average-hash-4B] |smh 11>
result |wikipedia 11> => 100 similar[hash-4B,unique-average-hash-4B] |wikipedia 11>
result |youtube 11> => 100 similar[hash-4B,unique-average-hash-4B] |youtube 11>

-- tidy results:
tidy-result |abc 11> => normalize[100] result |_self>
tidy-result |adelaidenow 11> => normalize[100] result |_self>
tidy-result |slashdot 11> => normalize[100] result |_self>
tidy-result |smh 11> => normalize[100] result |_self>
tidy-result |wikipedia 11> => normalize[100] result |_self>
tidy-result |youtube 11> => normalize[100] result |_self>
And here are the results:
sa: matrix[result]
[ average abc         ] = [  36.0  0      0      0      0      0      ] [ abc 11         ]
[ average adelaidenow ]   [  0     38.66  0      0      0      0      ] [ adelaidenow 11 ]
[ average slashdot    ]   [  0     0      35.48  0.04   0      0      ] [ slashdot 11    ]
[ average smh         ]   [  0     0.02   0.02   36.99  0      0      ] [ smh 11         ]
[ average wikipedia   ]   [  0     0.01   0.03   0      36.54  0      ] [ wikipedia 11   ]
[ average youtube     ]   [  0     0.02   0      0      0      36.72  ] [ youtube 11     ]

sa: matrix[tidy-result]
[ average abc         ] = [  100  0      0      0     0    0    ] [ abc 11         ]
[ average adelaidenow ]   [  0    99.87  0      0     0    0    ] [ adelaidenow 11 ]
[ average slashdot    ]   [  0    0      99.86  0.1   0    0    ] [ slashdot 11    ]
[ average smh         ]   [  0    0.05   0.05   99.9  0    0    ] [ smh 11         ]
[ average wikipedia   ]   [  0    0.03   0.1    0     100  0    ] [ wikipedia 11   ]
[ average youtube     ]   [  0    0.05   0      0     0    100  ] [ youtube 11     ]
Some notes:
1) that is a seriously good result! Yeah, without the normalize[100] we are down from 90% to 36% similarity, but the gap between best result and next best is now rather large! Which we see clearly in the tidy-result matrix (that does make use of normalize[100]). Heh, and we don't need drop-below[t] anymore either!
2) it is interesting that we can get such a big improvement using only 1 new line of code (the find-unique[average-hash-4B] bit) and a few tweaks to the existing BKO.
3) this technique of dropping back to considering only unique kets only works some of the time. For a start you need large superpositions, and a lot of unique kets from superposition to superposition. For example this technique would not work for the Iris example, the wage prediction example, or the document-type example. I'm wondering if there is a way to borrow the general idea of suppressing kets that are duplicate, but not as harsh as only considering unique kets. Maybe as simple as, if ket is in n superpositions, map coeff => coeff/n? Or do we need something smarter than that?
4) lets take a look at how many unique kets we have:
sa: how-many-hash |*> #=> to-comma-number how-many average-hash-4B |_self>
sa: how-many-unique-hash |*> #=> to-comma-number how-many unique-average-hash-4B |_self>
sa: delta |*> #=> arithmetic(how-many average-hash-4B |_self>,|->,how-many unique-average-hash-4B |_self>)
sa: table[website,how-many-hash,how-many-unique-hash,delta] "" |ave list>
+---------------------+---------------+----------------------+-------+
| website             | how-many-hash | how-many-unique-hash | delta |
+---------------------+---------------+----------------------+-------+
| average abc         | 1,492         | 1,391                | 101   |
| average adelaidenow | 11,869        | 11,636               | 233   |
| average slashdot    | 5,462         | 5,275                | 187   |
| average smh         | 10,081        | 9,784                | 297   |
| average wikipedia   | 3,182         | 3,084                | 98    |
| average youtube     | 6,390         | 6,310                | 80    |
+---------------------+---------------+----------------------+-------+
I didn't really expect that. I thought there would be a lot more duplicate kets, but instead we only see a couple of hundred. But since removing them had such an improvement on our results, presumably the duplicate kets had relatively large coeffs. eg, the ket generated from </a> in html will be universal across our webpages, and have a large coeff.

I guess that is it for this post. Back to find-topic[op] in the next couple of posts.

new function: find-unique[op]

The idea is simple enough. Given a collection of superpositions, find the kets that are unique to each of those superpositions. Motivated by my thoughts in my last post.

Here is a short example in the console:
-- load some data:
sa: load fred-sam-friends.sw

-- see what we have:
sa: dump
----------------------------------------
|context> => |context: friends>

friends |Fred> => |Jack> + |Harry> + |Ed> + |Mary> + |Rob> + |Patrick> + |Emma> + |Charlie>

friends |Sam> => |Charlie> + |George> + |Emma> + |Jack> + |Rober> + |Frank> + |Julie>
----------------------------------------

-- find the unique kets:
sa: find-unique[friends]

-- see what we now have:
sa: dump
----------------------------------------
|context> => |context: friends>

friends |Fred> => |Jack> + |Harry> + |Ed> + |Mary> + |Rob> + |Patrick> + |Emma> + |Charlie>
unique-friends |Fred> => |Harry> + |Ed> + |Mary> + |Rob> + |Patrick>

friends |Sam> => |Charlie> + |George> + |Emma> + |Jack> + |Rober> + |Frank> + |Julie>
unique-friends |Sam> => |George> + |Rober> + |Frank> + |Julie>
----------------------------------------
Here is a bigger example:
sa: load names.sw
sa: find-unique[names]
sa: save full-unique-names.sw

-- let's see how big our data set is now:
sa: how-many names |female name>
|number: 4275>
sa: how-many unique-names |female name>
|number: 2914>

sa: how-many names |male name>
|number: 1219>
sa: how-many unique-names |male name>
|number: 161>

sa: how-many names |last name>
|number: 88799>
sa: how-many unique-names |last name>
|number: 86747>
So hopefully that is all clear enough. I will put it to interesting use in the next post.

Also, BTW, this code seems to be really quite fast!

Update: I should also mention find-unique[op] belongs in the category of features that can learn new knowledge from our existing knowledge with minimal work from a human. This is always a good thing! A couple of other examples are "create inverse" and "map".

Update: a couple of other uses for find-unique[op].

1) say we have data on cultural factors:
cultural-factors |Australia> => ...
cultural-factors |UK> => ...
cultrual-factors |US> => ...
then:
find-unique[cultural-factors]

2) say we have dictionaries for different English dialects:
words |Australian English> => ...
words |UK English> => ...
words |US English> => ...
then:
find-unique[words]

Anyway, simple enough, but still useful!

Monday 23 March 2015

find-topic[names]

In this post we make use of normed-frequency-class, map-to-topic and find-topic[op] to give a probability of a name being male, female, or a last name. The data is from "Frequently Occurring Surnames from Census 1990", and here is our sw file of the same.

In the console:
-- load our data:
sa: load names.sw

-- let's see how big our data set is:
sa: how-many names |female name>
|number: 4275>

sa: how-many names |male name>
|number: 1219>

sa: how-many names |last name>
|number: 88799>

-- now play with the data:
sa: find-topic[names] |alice>
91.304|female name> + 8.696|last name> 

sa: find-topic[names] |bob>
61.404|male name> + 38.596|last name> 

sa: find-topic[names] |alex>
51.012|male name> + 33.401|last name> + 15.587|female name>

sa: find-topic[names] |sam>
47.818|male name> + 37.571|last name> + 14.611|female name>

-- save some typing:
h |*> #=> find-topic[names] |_self>

sa: h |smith>
100.000|last name>

sa: h |frank>
47.324|male name> + 41.831|last name> + 10.845|female name>

sa: h |bella>
53.333|last name> + 46.667|female name>

sa: h |lisa>
92.105|female name> + 7.895|last name>

sa: h |tim>
88.421|male name> + 11.579|last name>

sa: h |jane>
91.304|female name> + 8.696|last name>

sa: h |alexandria>
82.353|female name> + 17.647|last name>
Hopefully that is all clear enough. If you look at the relative probabilities in our results, it really does do quite a good job. Interestingly though, find-topic[op] works with any frequency list! Say perhaps you had frequencies of names from different countries. ie, names that are common in Italy, France, Germany, Japan, Scotland, Ireland and so on. Just having such frequency lists, find-topic[op] could guess the country of origin of a sample name.

Another thing to note is that frequency lists give vastly better results than plain lists with all coeffs equal. In that case, if a ket is in n lists, the coeff for each type would all be 100/n. There would be no information about if it is more of a female name vs a last name, say.

Also, it may be interesting to process the data and find the set of all names that are in one name type only (ie, those that return a coeff of 100 for find-topic[names]). I'll give it some thought on the easiest way to do that.

More uses of find-topic[op] coming up.

Update: this is how you find names that are in only 1 of the frequency lists:
is-unique |*> #=> is-equal[100] push-float find-topic[names] |_self>

|unique male names> => such-that[is-unique] names |male name>
|unique female names> => such-that[is-unique] names |female name>
|unique last names> => such-that[is-unique] names |last name>
Hrmm... seems slow. Here is another approach:
|unique names> => drop-above[1] (clean names |male name> + clean names |female name> + clean names |last name>)

is-male-name |*> #=> do-you-know intn(|_self>,names |male name>)
|unique male names> => such-that[is-male-name] "" |unique names>

is-female-name |*> #=> do-you-know intn(|_self>,names |female name>)
|unique female names> => such-that[is-female-name] "" |unique names>

is-last-name |*> #=> do-you-know intn(|_self>,names |last name>)
|unique last names> => such-that[is-last-name] "" |unique names>
Nope! That is even slower! I think I need to write some python to fix this.

Update: these examples represent what will probably become a common pattern:
is-something |*> #=> do-you-know mbr(|_self>,some-op |some list>)
-- where mbr is a kind of optimization of intersection.

Friday 20 March 2015

the map-to-topic and find-topic functions

The idea behind find-topic[op] is that given a collection of frequency lists, kets that change frequency drastically from list to list are the content, and those that have roughly invariant frequency (either very common like "the", or very rare) can be considered to contain less content. We measure frequency of a ket using normed-frequency-class function given in the last post.

Here is the python:
-- in the ket class:
  def find_topic(self,context,op):           
    return context.map_to_topic(self,op)

-- in the superposition class:
  def find_topic(self,context,op):           
    result = superposition()
    for x in self.data:
      result += context.map_to_topic(x,op)        # .drop_below(min) here too?
    r = result.normalize(100).coeff_sort()
    return r

-- in the new_context class:
  def map_to_topic(self,e,op,t=0):
    if type(op) == ket:
      op = op.label[4:]
    result = superposition()                                            
    for label in self.ket_rules_dict:
      if op in self.ket_rules_dict[label]:
        frequency_list = self.recall(op,label,True)                     
        value = normed_frequency_class(e,frequency_list)
        if value > t:                                                
          result.data.append(ket(label,value))    # "result += ket(label,value)" when swap in fast_superposition
    return result.normalize(100).coeff_sort()
I guess that is it. Though I suppose how the algo works, given the above python, is somewhat opaque. Examples in the next few posts.

Thursday 19 March 2015

the normed frequency class equation

In this post I will give the normed frequency class equation. I guess it could be considered a type of fuzzy set membership function. If all coeffs in a superposition X are equal, then it gives 1 if a ket is in X, and 0 if that ket is not in X. If the coeffs are not all equal then it has fuzzier properties.

Here is the python:
# e is a ket, X is a superposition
# for best effect X should be a frequency list
def normed_frequency_class(e,X):
  e = e.ket()                                  # make sure e is a ket, not a superposition, else X.find_value(e) bugs out.
  X = X.drop()                                 # drop elements with coeff <= 0
  smallest = X.find_min_coeff()                # return the min coeff in X as float
  largest = X.find_max_coeff()                 # return the max coeff in X as float
  f = X.find_value(e)                          # return the value of ket e in superposition X as float

  if largest <= 0 or f <= 0:                   # otherwise the math.log() blows up!
    return 0

  fc_max = math.floor(0.5 - math.log(smallest/largest,2)) + 1  # NB: the + 1 is important, else the smallest element in X gets reported as not in set.
  return 1 - math.floor(0.5 - math.log(f/largest,2))/fc_max
The motivation for this function is the frequency class equation given on wikipedia.
N = floor(1/2 - log_2(frequency-of-this-item/frequency-of-most-common-item))
All I have done is normalized it so 1 for best match, 0 for not in set.

I guess I should give some examples. Let's load up some knowledge in the console:
sa: load normed-frequency-class-examples.sw
sa: dump
----------------------------------------
|context> => |context: normed frequency class>

 |X> => |the> + |he> + |king> + |boy> + |outrageous> + |stringy> + |transduction> + |mouse>

 |Y> => 13|the> + 13|he> + 13|king> + 13|boy> + 13|outrageous> + 13|stringy> + 13|transduction> + 13|mouse>

 |Z> => 3789654|the> + 2098762|he> + 57897|king> + 56975|boy> + 76|outrageous> + 5|stringy> + |transduction> + |mouse>

the |*> #=> ket-nfc(|the>,""|_self>)
he |*> #=> ket-nfc(|he>,""|_self>)
king |*> #=> ket-nfc(|king>,""|_self>)
boy |*> #=> ket-nfc(|boy>,""|_self>)
outrageous |*> #=> ket-nfc(|outrageous>,""|_self>)
stringy |*> #=> ket-nfc(|stringy>,""|_self>)
transduction |*> #=> ket-nfc(|transduction>,""|_self>)
mouse |*> #=> ket-nfc(|mouse>,""|_self>)
not-in-set |*> #=> ket-nfc(|not-in-set>,""|_self>)

 |nfc table> #=> table[SP,the,he,king,boy,outrageous,stringy,transduction,mouse,not-in-set] split |X Y Z>
----------------------------------------

-- now take a look at the table:
sa: "" |nfc table>
+----+-----+----------+----------+----------+------------+----------+--------------+----------+------------+
| SP | the | he       | king     | boy      | outrageous | stringy  | transduction | mouse    | not-in-set |
+----+-----+----------+----------+----------+------------+----------+--------------+----------+------------+
| X  | nfc | nfc      | nfc      | nfc      | nfc        | nfc      | nfc          | nfc      | 0 nfc      |
| Y  | nfc | nfc      | nfc      | nfc      | nfc        | nfc      | nfc          | nfc      | 0 nfc      |
| Z  | nfc | 0.96 nfc | 0.74 nfc | 0.74 nfc | 0.30 nfc   | 0.13 nfc | 0.04 nfc     | 0.04 nfc | 0 nfc      |
+----+-----+----------+----------+----------+------------+----------+--------------+----------+------------+
And we can clearly see it has the properties promised above. |X> and |Y> give the same results even though X has all coeffs 1, and Y has all coeffs 13. "not-in-set" returned 0, since it is not in any of the three superpositions. And Z gives a nice demonstration of the fuzzy set membership idea.

That's it for this post. We will be putting it to use in the next couple of posts.

Update: for frequency lists, log(f/largest) is the best choice. But in other cases, maybe some other foo(f/largest) would work better. I haven't given it all that much thought.

Wednesday 18 March 2015

some categorize examples

This time, a couple of categorize examples.

First, an easy one. Recall H-I pattern recognition?
Now, in the console:
sa: load H-I-pat-rec.sw
sa: categorize[pixels,0.6,result]

sa: dump |result>
category-0 |result> => |letter: H> + |noisy: H> + |noisy: H2>
category-1 |result> => |letter: I> + |noisy: I> + |noisy: I2>
category-2 |result> => |letter: O>
See, as simple as that! And it worked.

Now for a harder example.
webpage superpositions:
sa: load improved-fragment-webpages.sw
sa: categorize[hash-4B,0.7,result]

-- now look at the results:
sa: dump |result>
category-0 |result> => |abc 1> + |abc 2> + |abc 3> + |abc 4> + |abc 5> + |abc 6> + |abc 7> + |abc 8> + |abc 9> + |abc 10> + |abc 11>
category-1 |result> => |adelaidenow 1> + |adelaidenow 2> + |adelaidenow 3> + |adelaidenow 4> + |adelaidenow 5> + |adelaidenow 6> + |adelaidenow 7> + |adelaidenow 8> + |adelaidenow 9> + |adelaidenow 10> + |adelaidenow 11>
category-2 |result> => |slashdot 1> + |slashdot 2> + |slashdot 3> + |slashdot 4> + |slashdot 5> + |slashdot 6> + |slashdot 7> + |slashdot 8> + |slashdot 9> + |slashdot 10> + |slashdot 11>
category-3 |result> => |smh 1> + |smh 2> + |smh 3> + |smh 4> + |smh 5> + |smh 6> + |smh 7> + |smh 8> + |smh 9> + |smh 10> + |smh 11>
category-4 |result> => |wikipedia 1> + |wikipedia 2> + |wikipedia 3> + |wikipedia 4> + |wikipedia 5> + |wikipedia 6> + |wikipedia 7> + |wikipedia 8> + |wikipedia 9> + |wikipedia 10> + |wikipedia 11>
category-5 |result> => |youtube 1> + |youtube 2> + |youtube 3> + |youtube 4> + |youtube 5> + |youtube 6> + |youtube 7> + |youtube 8> + |youtube 9> + |youtube 10> + |youtube 11>

-- now pretty print those results:
sa: websites |*> #=> apply(|_self>,|result>)
sa: table[category,websites] supported-ops |result>
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| category   | websites                                                                                                                                                              |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| category-0 | abc 1, abc 2, abc 3, abc 4, abc 5, abc 6, abc 7, abc 8, abc 9, abc 10, abc 11                                                                                         |
| category-1 | adelaidenow 1, adelaidenow 2, adelaidenow 3, adelaidenow 4, adelaidenow 5, adelaidenow 6, adelaidenow 7, adelaidenow 8, adelaidenow 9, adelaidenow 10, adelaidenow 11 |
| category-2 | slashdot 1, slashdot 2, slashdot 3, slashdot 4, slashdot 5, slashdot 6, slashdot 7, slashdot 8, slashdot 9, slashdot 10, slashdot 11                                  |
| category-3 | smh 1, smh 2, smh 3, smh 4, smh 5, smh 6, smh 7, smh 8, smh 9, smh 10, smh 11                                                                                         |
| category-4 | wikipedia 1, wikipedia 2, wikipedia 3, wikipedia 4, wikipedia 5, wikipedia 6, wikipedia 7, wikipedia 8, wikipedia 9, wikipedia 10, wikipedia 11                       |
| category-5 | youtube 1, youtube 2, youtube 3, youtube 4, youtube 5, youtube 6, youtube 7, youtube 8, youtube 9, youtube 10, youtube 11                                             |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
And again, simple, and it worked! Though this example did take 7 minutes. And being, I think, O(n^3), it becomes impractical for large data sets :(.

Heh, also. The BKO to make the table is somewhat opaque, even to me!

And that is it for categorize. On to normed frequency class for the next few posts.

Update: the best value for t in categorize (0.6 and 0.7 above), depends strongly on your data set. I usually make a similarity matrix first, and then use that to choose the value. Maybe there is a neater way? Consider the adult wage prediction task from the other day. In that case you would need t = 0.99998 or something to give good results!

a BKO version of categorize

Now, let's give the python for the BKO version of categorize. I'll give an example usage in the next post.

Here is the python:
# one is a superposition
# op is a string
# x is a ket
# thresh is a float
def simm_mbr(context,op,x,thresh,one):
  f = x.apply_op(context,op)
  for elt in one:
    g = elt.apply_op(context,op)
    if silent_simm(f,g) >= thresh:
      return True
  return False
   
# categorize[op,thresh,destination]
def categorize(context,parameters):
  try:
    op,thresh,destination = parameters.split(',')
    thresh = float(thresh)
    destination = ket(destination)
  except:
    return ket("",0)
  
  one = context.relevant_kets(op)                 # one is a superposition
  print("one:",one)  
  out_list = []                                   # out_list will be a list of superpositions.
  for x in one:                                   # x is of course a ket
    n = 0
    del_list = []                                 # del_list will be a list of integers.
    for i in range(len(out_list)):
      if simm_mbr(context,op,x,thresh,out_list[i]):
        if n == 0:
          out_list[i] += x
          idx = i
          n = 1
        else:
          out_list[idx] += out_list[i]
          del_list.append(i)
    if n == 0:
      out_list.append(superposition() + x)        # we use "superposition() + x" instead of just "x" so out_list is always a list of superpositions, not kets.
    else:
      out_list = [x for index,x in enumerate(out_list) if index not in del_list]

  for k, sp in enumerate(out_list):
    print("sp:",sp)
    context.learn("category-" + str(k),destination,sp)  
  return ket("categorize")
And that's it!

an implementation of categorize code

An implementation of the categorize code.

Here is the python, presumably it is correct, I haven't tested it recently:
def metric_mbr(metric,x,thresh,data):
  for elt in data:
    if metric(x,elt) <= thresh: -- depending on if 0 or 1 means exact match, you may want to swap to: >= thresh 
      return True
  return False

def categorize_list(data,metric,thresh):
  out_list = []
  for x in data:
    n = 0
    del_list = []
    for i in range(len(out_list)):
      if metric_mbr(metric,x,thresh,out_list[i]):
        if n == 0:
          out_list[i].append(x)
          idx = i
          n = 1
        else:
          out_list[idx] += out_list[i]
          del_list.append(i)
    if n == 0:
      out_list.append([x])
    else:
      out_list = [x for index, x in enumerate(out_list) if index not in del_list]
  return out_list
And that's it. BTW, yeah O(n^3) I think!

introducing bridging sets

This time a simple but useful maths idea. Basically it is the motivation for my categorize code, which I will give in the next couple of posts.

Simply enough, let's semi-formally define them:
First, we say x is near y if metric[x,y] <= t for a metric of your choice, and some threshold t.

Then a linear bridging set is a set {x0,x1,x2,x3,...,xn} such that:
1) x_k is near x_k+1, for all k in {0,1,...,n}
2) x_0 is not near x_n

A general bridging set is a set {x0,x1,x2,x3,...,xn} such that:
-- every element in the set is near some other element in the set:
1) for every j in {0,1,...,n}, x_j is near an x_k for some k != j in {0,1,...,n}. 
-- some elements may not be near other elements in the bridging set:
2) there may exist j,k pairs such that x_j is not near x_k
Here is a simple visualization of that:
where A is a linear bridging set and B and C are general bridging sets.

Some examples of linear bridging sets:
- The easiest is a bridge. Set the left bank to be x_0, the right bank to be x_n, and the steps you take from one side to the other form a linear bridging set (and hence the name "bridging set").

-  species DNA: I don't recall the species, but it had a strange property. The members of the species at the north of their range could not breed with those at the south of their range. But, interestingly, at any point in between they could breed with their neighbours. At least that is the way I remember it.

- Your appearance, first as a baby, then up through adulthood, and then old age forms a linear bridging set.

- Your weekly shopping basket is usually a linear bridging set (if you are a consistent shopper). Eg, from 5 years ago, week by week, till now.

- There are other trivial examples of linear bridging sets:
 - {a,b,c,d,e,f,...,x,y,z}
 - {0,1,2,3,4,5,6,...,20,21,22,23,24,25}
 - Water slowly brought to the boil.
 - indeed, anything that slowly evolves with time is a linear bridging set.

Some examples of general bridging sets:
- The 400m running track on an oval is a simple general bridging set, so is the path you take for your morning jog.

- If we have a metric that can measure the similarity in DNA (some version of simm perhaps), then each species form distinct bridging sets. ie, all members of a species are quite similar in their DNA, and dissimilar to other species.

- The tree of life, ie, the evolution of life from single cell to multi-cellular life is a big bridging set. A smaller version of this is you are in a bridging set with your parents, grand-parents, and back to your ancestors. And then via your parents, you are in a bridging set with your siblings, their children, their children's children and so on.

- A person's face from all different angles/perspectives form a bridging set. Ditto a banana, or a pen or a stapler, or a tiger, or an elephant, any object really.

- The collection of atoms that make up say a dog, form a general bridging set.

- Scenes in a movie/tv show, even as characters move around a bit, is a general bridging set.

Anyway, that is all simple enough!

The point? It motivates our categorize code. Categorize basically converts a set of elements into distinct/disjoint bridging sets. The python for that up next.

Some notes:
The value of t has a strong influence on the result.
Set it too tight and your categories splinter into smaller ones.
Set it too loose, and everything ends up in the same category.

The addition of a single new element can sometimes merge two or more categories into one, if it is in a key location.
And the other way too. The removal of a key element can fracture a category into two or more pieces (eg, if you remove the middle of the bridge, it is no longer a single bridging set).

Update: unlike support-vector-machines, bridging sets/categorize do not require the distinct/disjoint sets to be separable by a hyper-plane. eg, in an extreme example you could have a bulls-eye, a circle with a dot in the middle, and the code would classify them as two different bridging sets.

Update: in general it does not make much sense to consider the average of the elements in a bridging set. Depending on the exact bridging set, the average can be a long way from any member of the given bridging set.

using active buffer

Last post I gave the code for active buffer. This time a bunch of examples.

Start with an easy one. Given only part of a face, recognize it as a face.
-- define the features of a face:
sa: |body part: face> => 2|eyes> + |nose> + 2|ears> + 2|lips> + |hair>

-- even if we only have a couple of pieces of the face, we can still recognize it as a face
sa: active-buffer[7,0] (2|eyes> + |nose>)
0.750|body part: face>

-- again, hair, nose and lips is enough to say we have a face:
sa: active-buffer[7,0] (|hair> + |nose> + |lips>)
1.625|body part: face>

-- this time we see all parts of the face:
sa: active-buffer[7,0] (2|eyes> + |nose> + 2|ears> + 2|lips> + |hair>)
7.125|body part: face>
Given some knowledge of internet acronyms, interpret a sentence full of them:
-- load some data:
sa: load internet-acronyms.sw

-- read a short sentence:
sa: read |text: fwiw I think it is all fud imho, lol. thx.>
|word: fwiw> + |word: i> + |word: think> + |word: it> + |word: is> + |word: all> + |word: fud> + |word: imho> + |word: lol> + |word: thx>

-- apply active buffer to this:
sa: active-buffer[7,0] read |text: fwiw I think it is all fud imho, lol. thx.>
2.593|phrase: For What It's Worth> + 6.038|phrase: Fear, Uncertainty, Doubt> + 5.279|phrase: In My Humble Opinion> + 4.186|phrase: Laughing Out Loud> + 2.593|phrase: Thanks>
The breakfast menu example:
sa: load next-breakfast-menu.sw 
sa: description |food: Homestyle Breakfast>
|text: "Two eggs, bacon or sausage, toast, and our ever-popular hash browns">

-- read the description for Homestyle Breakfast
sa: read description |food: Homestyle Breakfast>
|word: two> + |word: eggs> + |word: bacon> + |word: or> + |word: sausage> + |word: toast> + |word: and> + |word: our> + |word: ever-popular> + |word: hash> + |word: browns>

-- apply active-buffer to this:
sa: active-buffer[7,1] read description |food: Homestyle Breakfast>
|number: 2> + |food: eggs> + |food: bacon> + |food: sausage> + |food: toast> + |food: hash browns>
The Fred/Mary/pregnancy example:
-- first load up some knowledge:
sa: |person: Fred Smith> => |word: fred> + |word: freddie> + |word: simth> + |word: smithie>  -- various names and nick-names
sa: |person: Mary> => |word: mazza>                                                           -- just a nick-name
sa: |greeting: Hey!> => |word: hey>
sa: |question: what is> => |word: what's>
sa: |direction: up> => |word: up>
sa: |phrase: having a baby> => read |text: having a baby>
sa: |phrase: in the family way> => read |text: in the family way>
sa: |phrase: up the duff> => read |text: up the duff>
sa: |phrase: with child> => read |text: with child>
sa: |concept: pregnancy> => |phrase: having a baby> + |phrase: in the family way> + |phrase: up the duff> + |phrase: with child>

-- now start playing with it:
sa: active-buffer[7,0] read |text: Hey Freddie what's up?>  
2.083|greeting: Hey!> + 1.500|person: Fred Smith> + 2.917|question: what is> + 2.083|direction: up> + 1.250|phrase: up the duff>
-- up the duff is in there because of the word "up"

-- now test phrase matching a concept, in this case phrases that mean pregnant.
sa: active-buffer[7,0] read |text: Hey Mazza, you with child, up the duff, in the family way, having a baby?>
2.593|greeting: Hey!> + 4.186|person: Mary> + 11.586|phrase: with child> + 6.857|direction: up> + 23.414|phrase: up the duff> + 25.000|phrase: in the family way> + 9.224|phrase: having a baby>

-- one more layer of active-buffer:
sa: active-buffer[7,0] active-buffer[7,0] read |text: Hey Mazza, you with child, up the duff, in the family way, having a baby?>
11.069|concept: pregnancy>
Some notes:
1) this code goes some way towards a computer understanding what it is "reading", if you can say a computer can read.

introducing active buffer

Here is yet another use of our simm. The idea is there seems to be various layers in the brain that have an active buffer of recent kets, and once the buffer is full, it is processed. When reading letters that buffer fills up with individual letters, and then when a non-letter char appears, the brain processes the buffer to decide what word it has just seen. Then the next level up is sentence fragments, the buffer fills with words until it sees punctuation (eg, comma or dot). And so on. Another example is in conversation with a person. You build up a buffer of words/sentences, then when they have made their point, you process that buffer.

Here is the usage:
-- uses "" as the default pattern.
active-buffer[N,t] some-superposition             

-- uses your chosen pattern (we can't use "" as the pattern, due to broken parser!)
active-buffer[N,t,pattern] some-superposition     

where: 
N is an int              -- the size of the active buffer  
t is a float             -- the drop below threshold
pattern is a string      -- the pattern we are using
Here is the python:
def console_active_buffer(one,context,parameters):  # one is the passed in superposition
  try:
    N,t,pattern = parameters.split(',')
    N = int(N)
    t = float(t)
  except:
    try:
      N,t = parameters.split(',')
      N = int(N)
      t = float(t)
      pattern = ""
    except:
      return ket("",0)
  
  one = superposition() + one                      # make sure one is a superposition, not a ket.    
  result = superposition()  
  data = one.data           
  for k in range(len(data)):
    for n in range(N):      
      if k < len(data) - n:
        y = superposition()
        y.data = data[k:k+n+1]                      # this is the bit you could call the buffer.        
        result += context.pattern_recognition(y,pattern).drop_below(t)
  return result        
Yeah, the code is a bit ugly! Not currently sure the best way to tidy it. Probably when I have working sequences.

Some notes:
1) this is implemented using superpositions, but really it would work better with sequences. Unfortunately I haven't fully worked out how sequences will work, let alone written any code.
2) The code above uses fixed sized buffers (N). It seems likely in the brain there are end-buffer kets that signal the end of the buffer and trigger processing of the buffer, rather than a fixed size. eg, for letters non-word chars trigger processing of a word.
3) the general supervised pattern recognition algo somewhat decreases the use case for this code.

Examples in the next post!

Tuesday 17 March 2015

spike fourier transform using simm

This post just a quick one. We can use similar[op] to do an approximation to a Fourier Transform.

Here is the BKO:
spikes |wave-1> => range(|0>,|1000>,|1>)
spikes |wave-2> => range(|0>,|1000>,|2>)
spikes |wave-3> => range(|0>,|1000>,|3>)
spikes |wave-4> => range(|0>,|1000>,|4>)
spikes |wave-5> => range(|0>,|1000>,|5>)
spikes |wave-6> => range(|0>,|1000>,|6>)
spikes |wave-7> => range(|0>,|1000>,|7>)
spikes |wave-8> => range(|0>,|1000>,|8>)
spikes |wave-9> => range(|0>,|1000>,|9>)
spikes |wave-10> => range(|0>,|1000>,|10>)
spikes |wave-11> => range(|0>,|1000>,|11>)
spikes |wave-12> => range(|0>,|1000>,|12>)
spikes |wave-13> => range(|0>,|1000>,|13>)
spikes |wave-14> => range(|0>,|1000>,|14>)
spikes |wave-15> => range(|0>,|1000>,|15>)
Now, load that into the console:
sa: load make-ft-spikes.sw
sa: simm |*> #=> 100 self-similar[spikes] |_self>
sa: map[simm,similarity] rel-kets[spikes] |>

sa: matrix[similarity]
[ wave-1  ] = [  100.0  50.05  33.37  25.07  20.08  16.68  14.29  12.59  11.19  10.09  9.09   8.39   7.69   7.19   6.69   ] [ wave-1  ]
[ wave-2  ]   [  50.05  100.0  33.33  50.1   20.16  33.33  14.37  25.15  11.18  20.16  9.18   16.77  7.78   14.37  6.79   ] [ wave-2  ]
[ wave-3  ]   [  33.37  33.33  100.0  25.15  20.06  50.0   14.37  12.57  33.53  10.18  9.28   25.15  7.78   7.19   20.06  ] [ wave-3  ]
[ wave-4  ]   [  25.07  50.1   25.15  100.0  20.32  33.47  14.34  50.2   11.16  20.32  9.16   33.47  7.97   14.34  6.77   ] [ wave-4  ]
[ wave-5  ]   [  20.08  20.16  20.06  20.32  100.0  16.92  14.43  12.94  11.44  50.25  9.45   8.46   7.96   7.46   33.33  ] [ wave-5  ]
[ wave-6  ]   [  16.68  33.33  50.0   33.47  16.92  100.0  14.37  25.15  33.53  20.36  9.58   50.3   7.78   14.37  20.36  ] [ wave-6  ]
[ wave-7  ]   [  14.29  14.37  14.37  14.34  14.43  14.37  100.0  12.59  11.19  10.49  9.09   8.39   7.69   50.35  6.99   ] [ wave-7  ]
[ wave-8  ]   [  12.59  25.15  12.57  50.2   12.94  25.15  12.59  100.0  11.11  20.63  9.52   33.33  7.94   14.29  7.14   ] [ wave-8  ]
[ wave-9  ]   [  11.19  11.18  33.53  11.16  11.44  33.53  11.19  11.11  100.0  10.71  9.82   25.0   8.04   7.14   20.54  ] [ wave-9  ]
[ wave-10 ]   [  10.09  20.16  10.18  20.32  50.25  20.36  10.49  20.63  10.71  100.0  9.9    16.83  7.92   14.85  33.66  ] [ wave-10 ]
[ wave-11 ]   [  9.09   9.18   9.28   9.16   9.45   9.58   9.09   9.52   9.82   9.9    100.0  8.79   7.69   7.69   7.69   ] [ wave-11 ]
[ wave-12 ]   [  8.39   16.77  25.15  33.47  8.46   50.3   8.39   33.33  25.0   16.83  8.79   100.0  8.33   14.29  20.24  ] [ wave-12 ]
[ wave-13 ]   [  7.69   7.78   7.78   7.97   7.96   7.78   7.69   7.94   8.04   7.92   7.69   8.33   100.0  7.79   7.79   ] [ wave-13 ]
[ wave-14 ]   [  7.19   14.37  7.19   14.34  7.46   14.37  50.35  14.29  7.14   14.85  7.69   14.29  7.79   100.0  6.94   ] [ wave-14 ]
[ wave-15 ]   [  6.69   6.79   20.06  6.77   33.33  20.36  6.99   7.14   20.54  33.66  7.69   20.24  7.79   6.94   100.0  ] [ wave-15 ]
I guess that is it. I leave it up to the audience to interpret the matrix results. I guess the main point is that if a wave has frequency f, then it is somewhat similar to a wave with frequency k*f for k an integer, and by "wave" I mean a very spiky wave.

Actually, we can visualize this with a couple of examples:
sa: table[wave,coeff] coeff-sort similarity |wave-2>
+---------+--------+
| wave    | coeff  |
+---------+--------+
| wave-2  | 100.0  |
| wave-4  | 50.1   |
| wave-1  | 50.05  |
| wave-3  | 33.333 |
| wave-6  | 33.333 |
| wave-8  | 25.15  |
| wave-5  | 20.16  |
| wave-10 | 20.16  |
| wave-12 | 16.766 |
| wave-7  | 14.371 |
| wave-14 | 14.371 |
| wave-9  | 11.178 |
| wave-11 | 9.182  |
| wave-13 | 7.784  |
| wave-15 | 6.786  |
+---------+--------+

sa: table[wave,coeff] coeff-sort similarity |wave-3>
+---------+--------+
| wave    | coeff  |
+---------+--------+
| wave-3  | 100.0  |
| wave-6  | 50.0   |
| wave-9  | 33.533 |
| wave-1  | 33.367 |
| wave-2  | 33.333 |
| wave-4  | 25.15  |
| wave-12 | 25.15  |
| wave-5  | 20.06  |
| wave-15 | 20.06  |
| wave-7  | 14.371 |
| wave-8  | 12.575 |
| wave-10 | 10.18  |
| wave-11 | 9.281  |
| wave-13 | 7.784  |
| wave-14 | 7.186  |
+---------+--------+

sa: table[wave,coeff] coeff-sort similarity |wave-4>
+---------+--------+
| wave    | coeff  |
+---------+--------+
| wave-4  | 100.0  |
| wave-8  | 50.199 |
| wave-2  | 50.1   |
| wave-6  | 33.466 |
| wave-12 | 33.466 |
| wave-3  | 25.15  |
| wave-1  | 25.075 |
| wave-5  | 20.319 |
| wave-10 | 20.319 |
| wave-7  | 14.343 |
| wave-14 | 14.343 |
| wave-9  | 11.155 |
| wave-11 | 9.163  |
| wave-13 | 7.968  |
| wave-15 | 6.773  |
+---------+--------+

sa: table[wave,coeff] coeff-sort similarity |wave-5>
+---------+--------+
| wave    | coeff  |
+---------+--------+
| wave-5  | 100.0  |
| wave-10 | 50.249 |
| wave-15 | 33.333 |
| wave-4  | 20.319 |
| wave-2  | 20.16  |
| wave-1  | 20.08  |
| wave-3  | 20.06  |
| wave-6  | 16.915 |
| wave-7  | 14.428 |
| wave-8  | 12.935 |
| wave-9  | 11.443 |
| wave-11 | 9.453  |
| wave-12 | 8.458  |
| wave-13 | 7.96   |
| wave-14 | 7.463  |
+---------+--------+
I guess that should make the results a little clearer.

Update: I wonder if the brain makes use of something like this?

Update: likewise, we should be able to use simm to measure distances.

document types to superpositions

So, I was wondering if we could identify document types just by looking at frequency of bytes in that document. This post plans to test that idea.

First, some python:
import sys

from the_semantic_db_code import *
from the_semantic_db_functions import *
from the_semantic_db_processor import *

C = context_list("files to superpositions")

# empty superposition:
empty = show_range(ket("byte: 0"),ket("byte: 255"),ket("1")).multiply(0)

def file_to_sp(filename):
  r = fast_superposition()
  with open(filename,'rb') as f:
    for line in f:
      for c in line:
        byte = str(c)
        r += ket("byte: " + byte)
  return (r.superposition().normalize(100) + empty).ket_sort()

files = [["binary","binary.exe"],["ebook","ebook.txt"],["encrypted","encrypted.gpg"],["website","website.html"],["zipped","zipped.zip"]]

for name, file in files:
  print("name:",name)
  result = file_to_sp("example-files/" + file)

  C.learn("bytes",name,result)
  for x in result:          # spit out the results so we can graph them
    print(x.value)
  print("====================")

# save the results:
sw_file = "sw-examples/files-to-bytes.sw"
save_sw(C,sw_file)
Here is the resulting sw file.

Here are the graphs:





Now, let's generate the similarity matrix:
sa: load files-to-bytes.sw
sa: simm |*> #=> 100 self-similar[bytes] |_self>
sa: map[simm,similarity] rel-kets[bytes] |>

sa: matrix[similarity]
[ binary    ] = [  100.0  12.44  47.86  16.07  47.64  ] [ binary    ]
[ ebook     ]   [  12.44  100.0  14.63  72.7   14.78  ] [ ebook     ]
[ encrypted ]   [  47.86  14.63  100.0  21.97  97.6   ] [ encrypted ]
[ website   ]   [  16.07  72.7   21.97  100.0  22.25  ] [ website   ]
[ zipped    ]   [  47.64  14.78  97.6   22.25  100.0  ] [ zipped    ]
So yeah. You can largely differentiate between document types (at least for this small sample) based just on byte counts. Cool.

I guess that is it for this post. As usual, heaps more to come!

Update: I wonder how small a file can get and still have distinctive document type? Or even further, is it possible to write an "English-grep" that only prints out lines that look like English? Of course for that to work, you need more than just letter counts. You need to count letter pairs, and probably triples, or perhaps some other method of creating a line => superposition.

Monday 16 March 2015

some wage prediction results

OK. Using the resulting sw file from last time, let's check out how well our proposed algo works. Decided to do a run with a random sample of 1000 test cases.

Here is the BKO, but I have since written a tidier version:
sa: load adult-wage-pattern-recognition.sw
sa: norm |above-50K> => .000127534753220 |_self>
sa: norm |below-50K> => .000040453074433 |_self>
sa: norm-h4 |*> #=> normalize[100] coeff-sort norm M select[1,5] similar[input-pattern,pattern] |_self>
sa: equal? |*> #=> equal(norm-h4|_self>,100 answer|_self>)
sa: is-equal? |*> #=> if(is-greater-equal-than[0.5] push-float equal? |_self>,equal? |_self>,|False>)
sa: table[input,norm-h4,answer,is-equal?] pick[1000] rel-kets[input-pattern] |>
Now, due to recalculating norm-h4 (and hence similar[op]) a bunch of times, this code was rather slow! It took over a day to finish. So for the next attempt, which I'm going to run against the full test suite, I'm going to do some precomputation:
-- keeping only 100 of the best matching patterns should be sufficient.
simm |*> #=> select[1,100] similar[input-pattern,pattern] |_self>
map[simm,similarity-result] rel-kets[input-pattern] |>
Now, what results did I get from the random sample of 1000 test cases? 761 correct! ie, 76.1%. I'd like mid eighties, but still not too bad. Unfortunately, I can't think of any easy tweak to improve the results. Besides, when I looked at the similarity results, it was apparent that superpositions with 14 elements just doesn't have enough data points for a data set this large.

The full result table is here.
-- a sample of that table:
+---------------+----------------------------------+-----------+-----------+
| input         | norm-h4                          | answer    | is-equal? |
+---------------+----------------------------------+-----------+-----------+
| example-5707  | 55.92 below-50K, 44.08 above-50K | below-50K | 0.56 True |
| example-9629  | 67.76 above-50K, 32.24 below-50K | below-50K | False     |
| example-14039 | 82.55 above-50K, 17.45 below-50K | below-50K | False     |
| example-731   | 67.76 above-50K, 32.24 below-50K | below-50K | False     |
| example-5362  | 100.00 below-50K                 | below-50K | 1.00 True |
| example-8     | 67.76 above-50K, 32.24 below-50K | above-50K | 0.68 True |
| example-5354  | 100 below-50K                    | below-50K | True      |
| example-9825  | 67.76 above-50K, 32.24 below-50K | below-50K | False     |
| example-14761 | 100 below-50K                    | below-50K | True      |
| example-14586 | 55.92 below-50K, 44.08 above-50K | below-50K | 0.56 True |
| example-8934  | 100 below-50K                    | below-50K | True      |
| example-11743 | 100 below-50K                    | below-50K | True      |
| example-10676 | 100 above-50K                    | above-50K | True      |
| example-9861  | 67.76 above-50K, 32.24 below-50K | below-50K | False     |
| example-7228  | 55.92 below-50K, 44.08 above-50K | below-50K | 0.56 True |
| example-8039  | 82.54 above-50K, 17.46 below-50K | below-50K | False     |
| example-11380 | 67.76 above-50K, 32.24 below-50K | below-50K | False     |
| example-7476  | 55.92 below-50K, 44.08 above-50K | below-50K | 0.56 True |
| example-10797 | 82.55 above-50K, 17.45 below-50K | above-50K | 0.83 True |
| example-13409 | 82.54 above-50K, 17.46 below-50K | above-50K | 0.83 True |
| example-14147 | 100 above-50K                    | above-50K | True      |
| example-4809  | 100 below-50K                    | below-50K | True      |
| example-5080  | 67.76 above-50K, 32.24 below-50K | below-50K | False     |
...
Some notes:
1) Here is an example similarity result. Note that large numbers of superpositions are essentially identical, making it hard for my particular algo. Recall that my webpage example had superpositions with around 2000 distinct kets, compared to just 14 here.
sa: table[input,coeff,M] select[1,15] 100 similar[input-pattern,pattern] |example-9861>
+------------+--------+-----------+
| input      | coeff  | M         |
+------------+--------+-----------+
| node-16657 | 99.998 | above-50K |
| node-24189 | 99.998 | above-50K |
| node-18407 | 99.998 | below-50K |
| node-9655  | 99.997 | below-50K |
| node-23605 | 99.997 | below-50K |
| node-821   | 99.997 | below-50K |
| node-18534 | 99.997 | below-50K |
| node-12712 | 99.997 | above-50K |
| node-4948  | 99.997 | above-50K |
| node-7466  | 99.997 | above-50K |
| node-10614 | 99.997 | above-50K |
| node-16614 | 99.997 | below-50K |
| node-4179  | 99.997 | below-50K |
| node-15637 | 99.997 | above-50K |
| node-12012 | 99.997 | below-50K |
+------------+--------+-----------+
Point is, this data set fails this pattern recognition requirement (since above-50K and below-50K are not easily distinguishable):
  3) distinctive means different object types have easily distinguishable superpositions
2) In trying to find a good mapping function h, on first attempt I used:
h |*> #=> coeff-sort M similar[input-pattern,pattern] |_self>
and got terrible results! Everything was classified as "below-50K". The fix was to take into account the relative numbers of below-50K (which was the vast majority) and above-50k. I had a look at the training data set:
$ grep -c "<=50K" adult.data
24720

$ grep -c ">50K" adult.data
7841
And hence my norm term, which is 1/label-count:
 norm |above-50K> => .000127534753220 |_self>
 norm |below-50K> => .000040453074433 |_self>
But that was still not enough. I tried a drop-below[t] but that too failed miserably. Eventually, I found select[1,n] was the way to go. And was pretty stubbornly around 75% success rate for n in {5,10,15}. Point is, so far this is the h that works best:
sa: norm-h4 |*> #=> normalize[100] coeff-sort norm M select[1,5] similar[input-pattern,pattern] |_self>
The select[1,5] essentially means we are just taking an average of only the top 5 best matches.
3) Long term, the speed of similar[op] on large data-sets is not an issue. It will be trivial to write parallel versions.

That's it for this post. I think I will move on to other things in my next few posts. I'm reasonably happy with my proof of concepts for my supervised pattern recognition algo.

harder supervised pattern recognition example

This time a harder example. I decided on this one, trying to predict, given some statistics, whether a person will earn above or below 50K.

Here is the python:
import sys

from the_semantic_db_code import *
from the_semantic_db_functions import *
from the_semantic_db_processor import *

C = context_list("adult wage pattern recognition")

short_sample = "data/adult/30-sample.data"
training_data = "data/adult/adult.data"
test_data = "data/adult/adult.test"

def learn_data(C,filename,training=True):
  k = 0
  with open(filename,'r') as f:
    for line in f:
      try:
        age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,wage_class = line.strip().split(', ')
        k += 1
        r = ket("age",age) + ket(workclass) + ket("fnlwgt",fnlwgt) + ket(education) + ket("education-num",education_num) + ket(marital_status) + ket(occupation)
        r += ket(relationship) + ket(race) + ket(sex) + ket("capital-gain",capital_gain) + ket("capital-loss",capital_loss) + ket("hours-per-week",hours_per_week) + ket(native_country)

        # heh. adult.data uses "<=50K" and ">50K", while adult.test uses "<=50K." and ">50K."
        # tweak to fix that:
        wage_class = wage_class.rstrip('.')
        if wage_class == "<=50K":
          ket_wage_class = ket("below-50K")
        elif wage_class == ">50K":
          ket_wage_class = ket("above-50K")
        else:
          ket_wage_class = ket("")

        if training:                         # learn training data set:
          node = ket("node-" + str(k))
          C.learn("pattern",node,r)
          C.learn("M",node,ket_wage_class)
        else:
          node = ket("example-" + str(k))    # learn test cases:
          C.learn("input-pattern",node,r)
          C.learn("answer",node,ket_wage_class)
      except:
        continue

# test cases to check my code is correct.
#learn_data(C,short_sample,True)
#learn_data(C,short_sample,False)

# the main event:
learn_data(C,training_data,True)
learn_data(C,test_data,False)

# save the results:
sw_file = "sw-examples/adult-wage-pattern-recognition.sw"
save_sw(C,sw_file,False)
And that is about it. Here is the resulting sw file (15 MB, 4 op types and 97,685 learn rules). Results in the next post.

Thursday 12 March 2015

fixed supervised learning of iris classes

Last time I gave an example of supervised learning of iris classes. Funnily enough my code was buggy, and yet still gave good results! This post, the results with the fixed code.

The problem was my definition of the superpositions. I used:
r = ket("sepal-length: " +  sepal_len) + ket("sepal-width: " + sepal_width) + ket("petal-length: " + petal_len) + ket("petal-width: " + petal_width)
this means the similarity metric has no access to the float values. Either the kets are identical, or they are not. I'm surprised it worked at all! This is the fix:
r = ket("sepal-length",sepal_len) + ket("sepal-width",sepal_width) + ket("petal-length",petal_len) + ket("petal-width",petal_width)
And now, let's re run the data:
sa: load improved-iris-pattern-recognition.sw
sa: h2 |*> #=> coeff-sort M similar[input-pattern,pattern] |_self>
sa: discrimination2 |*> #=> round[3] push-float discrim h2 |_self>
sa: table[input,h2,discrimination2] split |node-41 node-42 node-43 node-44 node-45 node-46 node-47 node-48 node-49 node-50>
+---------+----------------------------------------------------------------+-----------------+
| input   | h2                                                             | discrimination2 |
+---------+----------------------------------------------------------------+-----------------+
| node-41 | 38.87 Iris-setosa, 30.68 Iris-versicolor, 28.65 Iris-virginica | 8.198           |
| node-42 | 37.31 Iris-setosa, 31.96 Iris-versicolor, 29.93 Iris-virginica | 5.349           |
| node-43 | 38.95 Iris-setosa, 30.93 Iris-versicolor, 28.90 Iris-virginica | 8.021           |
| node-44 | 38.20 Iris-setosa, 32.56 Iris-versicolor, 30.54 Iris-virginica | 5.632           |
| node-45 | 38.12 Iris-setosa, 32.55 Iris-versicolor, 30.53 Iris-virginica | 5.563           |
| node-46 | 38.77 Iris-setosa, 31.50 Iris-versicolor, 29.47 Iris-virginica | 7.271           |
| node-47 | 38.81 Iris-setosa, 31.07 Iris-versicolor, 29.04 Iris-virginica | 7.745           |
| node-48 | 39.09 Iris-setosa, 31.15 Iris-versicolor, 29.12 Iris-virginica | 7.941           |
| node-49 | 39.07 Iris-setosa, 30.69 Iris-versicolor, 28.67 Iris-virginica | 8.375           |
| node-50 | 39.07 Iris-setosa, 30.80 Iris-versicolor, 28.78 Iris-virginica | 8.264           |
+---------+----------------------------------------------------------------+-----------------+

sa: table[input,h2,discrimination2] split |node-91 node-92 node-93 node-94 node-95 node-96 node-97 node-98 node-99 node-100>
+----------+----------------------------------------------------------------+-----------------+
| input    | h2                                                             | discrimination2 |
+----------+----------------------------------------------------------------+-----------------+
| node-91  | 38.71 Iris-versicolor, 38.51 Iris-virginica, 30.31 Iris-setosa | 0.199           |
| node-92  | 39.00 Iris-versicolor, 38.19 Iris-virginica, 30.76 Iris-setosa | 0.813           |
| node-93  | 39.07 Iris-versicolor, 37.60 Iris-virginica, 31.36 Iris-setosa | 1.468           |
| node-94  | 38.91 Iris-versicolor, 37.14 Iris-virginica, 31.83 Iris-setosa | 1.769           |
| node-95  | 39.03 Iris-versicolor, 38.24 Iris-virginica, 30.72 Iris-setosa | 0.789           |
| node-96  | 38.80 Iris-versicolor, 37.62 Iris-virginica, 31.34 Iris-setosa | 1.181           |
| node-97  | 38.96 Iris-versicolor, 37.90 Iris-virginica, 31.06 Iris-setosa | 1.064           |
| node-98  | 39.07 Iris-versicolor, 37.55 Iris-virginica, 31.42 Iris-setosa | 1.525           |
| node-99  | 38.14 Iris-versicolor, 36.32 Iris-virginica, 32.64 Iris-setosa | 1.821           |
| node-100 | 39.04 Iris-versicolor, 37.84 Iris-virginica, 31.12 Iris-setosa | 1.194           |
+----------+----------------------------------------------------------------+-----------------+

sa: table[input,h2,discrimination2] split |node-141 node-142 node-143 node-144 node-145 node-146 node-147 node-148 node-149 node-150>
+----------+----------------------------------------------------------------+-----------------+
| input    | h2                                                             | discrimination2 |
+----------+----------------------------------------------------------------+-----------------+
| node-141 | 38.82 Iris-virginica, 37.60 Iris-versicolor, 28.68 Iris-setosa | 1.217           |
| node-142 | 38.47 Iris-virginica, 38.17 Iris-versicolor, 29.65 Iris-setosa | 0.299           |
| node-143 | 39.02 Iris-virginica, 37.54 Iris-versicolor, 28.59 Iris-setosa | 1.478           |
| node-144 | 38.97 Iris-virginica, 37.57 Iris-versicolor, 28.64 Iris-setosa | 1.402           |
| node-145 | 38.62 Iris-virginica, 37.50 Iris-versicolor, 28.64 Iris-setosa | 1.12            |
| node-146 | 38.68 Iris-virginica, 38.00 Iris-versicolor, 29.22 Iris-setosa | 0.679           |
| node-147 | 38.85 Iris-virginica, 37.95 Iris-versicolor, 29.08 Iris-setosa | 0.905           |
| node-148 | 38.97 Iris-virginica, 38.25 Iris-versicolor, 29.41 Iris-setosa | 0.727           |
| node-149 | 38.30 Iris-virginica, 37.50 Iris-versicolor, 28.86 Iris-setosa | 0.797           |
| node-150 | 38.89 Iris-virginica, 37.99 Iris-versicolor, 29.19 Iris-setosa | 0.9             |
+----------+----------------------------------------------------------------+-----------------+
And there we have it! 100% success rate. I would like the discrimination to be higher, but otherwise it is good. I guess a bigger example is in my near future. The good news is we can largely reuse this code for the next example.

Also, we may have luck using a h with a drop-below in it. eg:
h |*> #=> coeff-sort M drop-below[t] similar[input-pattern,pattern] |_self>
the question is how to find the best t? I guess discrimination is part of that answer. And there are other alternatives using sigmoids and so on in there. For now I don't super care.

supervised learning of iris classes

Last post I gave the general algo for supervised pattern recognition. Kind of a big claim, so let's try an example bigger than truth tables. I decided iris types would be a good example. This data set consists of 150 examples. 50 for each type of iris. So I decided to use 40 of each type as the training set, and 10 of each type for the test cases.

Now, we need some python to convert the data into sw form.
Python:
import sys

from the_semantic_db_code import *
from the_semantic_db_functions import *
from the_semantic_db_processor import *

C = context_list("iris pattern recognition")

data_file = "data/iris-data/bezdekIris.data"

def learn_data(C,filename):
  k = 0
  with open(filename,'r') as f:
    for line in f:
      try:
        sepal_len,sepal_width,petal_len,petal_width,iris_class = line.strip().split(',')
        k += 1
        node = ket("node-" + str(k))
        r = ket("sepal-length: " +  sepal_len) + ket("sepal-width: " + sepal_width) + ket("petal-length: " + petal_len) + ket("petal-width: " + petal_width)
        if ((k - 1) % 50) < 40:              # learn training data set:
          C.learn("pattern",node,r)
          C.learn("M",node,iris_class)
        else:
          C.learn("input-pattern",node,r)    # learn test cases:
      except:
        continue

learn_data(C,data_file)

# save the result:
sw_file = "sw-examples/iris-pattern-recognition.sw"
save_sw(C,sw_file)
Now for the results:
sa: load iris-pattern-recognition.sw
sa: h2 |*> #=> coeff-sort M similar[input-pattern,pattern] |_self>
sa: discrimination |*> #=> push-float discrim h2 |_self>

-- NB: use split to save typing
-- the Iris-setosa test cases:
sa: table[input,h2,discrimination] split |node-41 node-42 node-43 node-44 node-45 node-46 node-47 node-48 node-49 node-50> 
+---------+-------------------------------------------------------------+----------------+
| input   | h2                                                          | discrimination |
+---------+-------------------------------------------------------------+----------------+
| node-41 | 4.25 Iris-setosa, 0.25 Iris-versicolor                      | 4              |
| node-42 | 2 Iris-setosa, 0.50 Iris-versicolor                         | 1.5            |
| node-43 | 8.25 Iris-setosa, Iris-virginica, 0.75 Iris-versicolor      | 7.25           |
| node-44 | 3.50 Iris-setosa, 0.25 Iris-versicolor                      | 3.25           |
| node-45 | 3.75 Iris-setosa, 0.50 Iris-virginica                       | 3.25           |
| node-46 | 5.75 Iris-setosa, 2.25 Iris-virginica, 1.50 Iris-versicolor | 3.5            |
| node-47 | 9.25 Iris-setosa, 0.50 Iris-virginica                       | 8.75           |
| node-48 | 10 Iris-setosa, Iris-virginica, 0.75 Iris-versicolor        | 9              |
| node-49 | 9.50 Iris-setosa                                            | 9.5            |
| node-50 | 10 Iris-setosa, 0.50 Iris-versicolor, 0.50 Iris-virginica   | 9.5            |
+---------+-------------------------------------------------------------+----------------+

-- the Iris-versicolor test cases:
sa: table[input,h2,discrimination] split |node-91 node-92 node-93 node-94 node-95 node-96 node-97 node-98 node-99 node-100>
+----------+-------------------------------------------------------------+----------------+
| input    | h2                                                          | discrimination |
+----------+-------------------------------------------------------------+----------------+
| node-91  | 2.50 Iris-versicolor, 0.50 Iris-setosa, 0.50 Iris-virginica | 2              |
| node-92  | 4.25 Iris-versicolor, 3 Iris-virginica, 1.25 Iris-setosa    | 1.25           |
| node-93  | 2.25 Iris-versicolor, Iris-virginica, 0.25 Iris-setosa      | 1.25           |
| node-94  | 2.50 Iris-versicolor, 1.25 Iris-setosa                      | 1.25           |
| node-95  | 4.50 Iris-versicolor, Iris-virginica                        | 3.5            |
| node-96  | 2.75 Iris-versicolor, 2.50 Iris-virginica, 1.75 Iris-setosa | 0.25           |
| node-97  | 4.25 Iris-versicolor, 0.75 Iris-setosa, 0.75 Iris-virginica | 3.5            |
| node-98  | 4 Iris-versicolor, 0.75 Iris-virginica, 0.25 Iris-setosa    | 3.25           |
| node-99  | 1.50 Iris-setosa, 1.25 Iris-versicolor, 0.75 Iris-virginica | 0.25           |
| node-100 | 4.50 Iris-versicolor, 2.25 Iris-virginica, 0.50 Iris-setosa | 2.25           |
+----------+-------------------------------------------------------------+----------------+

-- the Iris-virginica test cases:
sa: table[input,h2,discrimination] split |node-141 node-142 node-143 node-144 node-145 node-146 node-147 node-148 node-149 node-150>
+----------+-------------------------------------------------------------+----------------+
| input    | h2                                                          | discrimination |
+----------+-------------------------------------------------------------+----------------+
| node-141 | 2.75 Iris-virginica, 1.50 Iris-versicolor, Iris-setosa      | 1.25           |
| node-142 | 3 Iris-virginica, 1.25 Iris-versicolor, Iris-setosa         | 1.75           |
| node-143 | 3 Iris-virginica, 1.75 Iris-versicolor, 0.25 Iris-setosa    | 1.25           |
| node-144 | 2.50 Iris-virginica, Iris-versicolor, 0.75 Iris-setosa      | 1.5            |
| node-145 | 2 Iris-virginica, Iris-versicolor, 0.25 Iris-setosa         | 1              |
| node-146 | 3.75 Iris-virginica, 2.25 Iris-versicolor, 1.25 Iris-setosa | 1.5            |
| node-147 | 3.25 Iris-virginica, 1.75 Iris-versicolor                   | 1.5            |
| node-148 | 4.25 Iris-virginica, 1.75 Iris-versicolor, 1.25 Iris-setosa | 2.5            |
| node-149 | 2.25 Iris-setosa, 1.75 Iris-virginica, 0.50 Iris-versicolor | 0.5            |
| node-150 | 5.75 Iris-virginica, 2.50 Iris-versicolor, 1.25 Iris-setosa | 3.25           |
+----------+-------------------------------------------------------------+----------------+
Notes:
1) we have two wrong answers, node-99 and node-149. Though they do have small discrimination of 0.25 and 0.5 respectively, so the code knows its results in those cases might be in error. Otherwise, the discrimination is really good! And, 100*28/30 = 93.3% success rate.
2) I tried the h operator:
h |*> #=> M drop-below[0.7] similar[input-pattern,pattern] |_self>
but it gave terrible results!
3) We can interpret the h2 operator:
h2 |*> #=> coeff-sort M similar[input-pattern,pattern] |_self>
as a weighted average of M applied to the results from similar.
eg: observe the results from similar applied to node-150:
sa: table[node,coeff] 100 similar[input-pattern,pattern] |node-150>
+----------+-------+
| node     | coeff |
+----------+-------+
| node-62  | 50    |
| node-71  | 50    |
| node-117 | 50    |
| node-128 | 50    |
| node-139 | 50    |
| node-2   | 25    |
| node-13  | 25    |
| node-14  | 25    |
| node-26  | 25    |
| node-39  | 25    |
| node-67  | 25    |
| node-76  | 25    |
| node-78  | 25    |
| node-84  | 25    |
| node-85  | 25    |
| node-89  | 25    |
| node-102 | 25    |
| node-103 | 25    |
| node-104 | 25    |
| node-105 | 25    |
| node-106 | 25    |
| node-108 | 25    |
| node-109 | 25    |
| node-111 | 25    |
| node-113 | 25    |
| node-115 | 25    |
| node-124 | 25    |
| node-126 | 25    |
| node-127 | 25    |
| node-130 | 25    |
| node-134 | 25    |
| node-136 | 25    |
| node-138 | 25    |
+----------+-------+
That's it! It works! I guess next I should try an even larger example.

the general supervised pattern recognition algo

Last post I demonstrated toy supervised pattern recognition using logic states/gates. This time let's semi-formally define our algo:

Given the training data set D:
D = {(X1,Y1),(X2,Y2),...(Xn,Yn)}
where Xi, and Yi are superpositions (and must not be empty superpositions that have all coeffs equal to 0)

Then learn these rules:
pattern |node: 1> => X1
pattern |node: 2> => X2
...
pattern |node: n> => Xn

M |node: 1> => Y1
M |node: 2> => Y2
...
M |node: n> => Yn

Then given the unlabeled data set U = {Z1,Z2,...Zm}, where Zi are superpositions of the same type as Xi, learn these rules:
input-pattern |example: 1> =>  Z1
input-pattern |example: 2> =>  Z2
...
input-pattern |example: m> =>  Zm

Then here is one proposed h: X -> Y:
h |*> #=> M drop-below[0.7] similar[input-pattern,pattern] |_self>

Here is another proposed h: X -> Y:
h2 |*> #=> coeff-sort M similar[input-pattern,pattern] |_self>

A worked example coming up next!

Update: I suspect for more complex data sets we will need something like this:
h |*> #=> coeff-sort M some-complex-function similar[input-pattern,pattern] |_self>
for some as yet unknown "some-complex-function" operator.

Update: I suspect we can use this scheme to implement large numbers of if/then rules. eg perhaps:
"if Xi then Yi" using:
if-pattern |node: i> => Xi
M |node: i> => Yi
h |*> #=> M drop-below[t] similar[input-pattern,if-pattern] |_self>

Update: here is a proof of concept:
sa: dump
----------------------------------------
|context> => |context: sw console>

if-pattern |node-1> => |Fred is whistling>
M |node-1> => |Fred is happy>

if-pattern |node-2> => |Sam is whistling>
M |node-2> => |Sam is happy>

input-pattern |x> => |Fred is whistling>
input-pattern |y> => |Sam is whistling>

h |*> #=> M similar[input-pattern,if-pattern] |_self>
----------------------------------------

sa: h |x>
|Fred is happy>

sa: h |y>
|Sam is happy>
Looks good. And of course, this means we can easily load up large numbers of rules like this, and pass them around the internet as sw files.

Update: let's tidy up our proof of concept example using similar-input[op] and if-then machines:
----------------------------------------
|context> => |context: simple if-then machine>

pattern |node: 1> => |Fred is whistling>
then |node: 1> => |Fred is happy>

pattern |node: 2> => |Sam is whistling>
then |node: 2> => |Sam is busy>

implies? |*> #=> then similar-input[pattern] |_self>
----------------------------------------
Now put it to use:
sa: implies? |Fred is whistling>
|Fred is happy>

sa: implies? |Sam is whistling>
|Sam is busy>
OK. Very simple example. But I hope it shows something interesting. And with a little imagination it could be used for all sorts of things. And note in this example I used single kets as the patterns and consequence, but with not much work we could do similar using superpositions. Go see the if-then machines examples for that. The next question is how do we auto learn these kinds of rules? We don't want to go the cyc route and hand define everything! Certainly we would expect a full AGI to be able to auto learn rules. But can we have auto learning and yet not have a full AGI?

Wednesday 11 March 2015

supervised pattern recognition

So, according to wikipedia we have this definition:


Formally, the problem of supervised pattern recognition can be stated as follows: Given an unknown function g:\mathcal{X}\rightarrow\mathcal{Y} (the ground truth) that maps input instances \boldsymbol{x} \in \mathcal{X} to output labels y \in \mathcal{Y}, along with training data \mathbf{D} = \{(\boldsymbol{x}_1,y_1),\dots,(\boldsymbol{x}_n, y_n)\} assumed to represent accurate examples of the mapping, produce a function h:\mathcal{X}\rightarrow\mathcal{Y} that approximates as closely as possible the correct mapping g.

Now, I have been making claims about pattern recognition using my similar[op]. Let me show that we can actually do this with the BKO scheme:

Here is the training data for AND (and we will use the convention that x0 is always 1):
(1,0,0),0
(1,0,1),0
(1,1,0),0
(1,1,1),1
-- now, let's encode that in BKO:
-- NB: if a ket has 0 for coeff we don't include them in our superpositions.
pattern |a> => |x0> 
pattern |b> => |x0> + |x2>
pattern |c> => |x0> + |x1>
pattern |d> => |x0> + |x1> + |x2>
M |a> => |0>
M |b> => |0>
M |c> => |0>
M |d> => |1>

-- now define some example input patterns:
input-pattern |u> => |x0> 
input-pattern |v> => |x0> + |x2>
input-pattern |x> => |x0> + |x1>
input-pattern |y> => |x0> + |x1> + |x2>

-- now define our h:
h |*> #=> M drop-below[0.7] similar[input-pattern,pattern] |_self>

-- now take a look at the table:
sa: table[input,h] (|u> + |v> + |x> + |y>)
+-------+---+
| input | h |
+-------+---+
| u     | 0 |
| v     | 0 |
| x     | 0 |
| y     | 1 |
+-------+---+
It worked!

Here is the training data for NAND from here:
(1,0,0),1
(1,0,1),1
(1,1,0),1
(1,1,1),0
-- now, let's encode that in BKO:
pattern |a> => |x0>
pattern |b> => |x0> + |x2>
pattern |c> => |x0> + |x1>
pattern |d> => |x0> + |x1> + |x2>
M |a> => |1>
M |b> => |1>
M |c> => |1>
M |d> => |0>

-- now define some example input patterns:
input-pattern |u> => |x0>
input-pattern |v> => |x0> + |x2>
input-pattern |x> => |x0> + |x1>
input-pattern |y> => |x0> + |x1> + |x2>

-- now take a look at the table:
-- NB: split is just to save typing
sa: table[input,h] split |u v x y>
+-------+---+
| input | h |
+-------+---+
| u     | 1 |
| v     | 1 |
| x     | 1 |
| y     | 0 |
+-------+---+
Again, it worked!

And finally, the cool one! XOR is meant to be hard, but we can do it in our scheme.
Here is the training data for XOR:
(1,1,0),1
(1,0,1),1
(1,0,0),0
(1,1,1),0
-- now let's encode that in BKO:
pattern |a> => |x0> + |x1>
pattern |b> => |x0> + |x2>
pattern |c> => |x0>
pattern |d> => |x0> + |x1> + |x2>
M |a> => |1>
M |b> => |1>
M |c> => |0>
M |d> => |0>

-- now define some example input patterns:
input-pattern |u> => |x0> + |x1>
input-pattern |v> => |x0> + |x2>
input-pattern |x> => |x0>
input-pattern |y> => |x0> + |x1> + |x2>

-- now take a look at the table:
sa: table[input,h] split |u v x y>
+-------+---+
| input | h |
+-------+---+
| u     | 1 |
| v     | 1 |
| x     | 0 |
| y     | 0 |
+-------+---+
Again, it worked!!
Now for a sample test case where the pattern is not an exact match:
sa: input-pattern |z> => 0.7|x0> + 0.900|x1> + 0.85|x2>
sa: table[input,h] split |u v x y z>
+-------+--------+
| input | h      |
+-------+--------+
| u     | 1      |
| v     | 1      |
| x     | 0      |
| y     | 0      |
| z     | 0.95 0 |
+-------+--------+
ie, z is a 95% match with |0>

Now, look at the underlying similarity values:
sa: foo |*> #=> 100 similar[input-pattern,pattern] |_self>
sa: table[input,foo] split |u v x y z>
+-------+------------------------------------+
| input | foo                                |
+-------+------------------------------------+
| u     | 100 a, 66.67 d, 50 b, 50 c         |
| v     | 100 b, 66.67 d, 50 a, 50 c         |
| x     | 100 c, 50 a, 50 b, 33.33 d         |
| y     | 100 d, 66.67 a, 66.67 b, 33.33 c   |
| z     | 95.24 d, 65.31 a, 63.27 b, 28.57 c |
+-------+------------------------------------+
And some notes:
1) it should be obvious that if it works for one logic case it will work for all cases. Why? Well, the detection of the logic state (ie, the similar[op] bit) is independent of the apply label stage (ie, the M |a> => ...).
2) we have only 1 parameter to tweak, the value we use in drop-below[] that decides how close to the original pattern the incoming pattern has to be, before we ignore it. But for now drop-below[0.7] (ie 70% similarity) should probably do. Though we could also swap in a sigmoid if we want.
3) I guess you could call this version "inclusive matching". A node does not "care" if its neighbour nodes are firing. In "exclusive matching" a node will try to suppress the firing of neighbouring nodes.
4) the method is perfectly general. If we cast our training data to superpositions, and use this general training data:
D = {(X1,Y1),(X2,Y2),...(Xn,Yn)}
where Xi, and Yi are superpositions.
Then simply enough:
pattern |node: 1> => X1
pattern |node: 2> => X2
...
pattern |node: n> => Xn

M |node: 1> => Y1
M |node: 2> => Y2
...
M |node: n> => Yn

-- define our h:
h |*> #=> M drop-below[0.7] similar[input-pattern,pattern] |_self>

And that should be it! The only thing is we need to take care if Xi or Yi are superpositions with all coeffs 0, in which case we would need to do some massaging. Besides, who inputs "nothing" and expects to get out "something"?

Seriously cool! And BTW, similar[op] is biologically plausible in terms of neurons. I'll show how in phase 4.

Update: let's redo this, and tidy up the logic example a little.
Define our patterns we want to match:
(1,0,0)
(1,0,1)
(1,1,0)
(1,1,1)
Then in BKO:
pattern |a> => |x0> + 0|x1> + 0|x2>
pattern |b> => |x0> + 0|x1> + |x2>
pattern |c> => |x0> + |x1> + 0|x2>
pattern |d> => |x0> + |x1> + |x2>
Now define our labels:
OR-label |a> => |0>
OR-label |b> => |1>
OR-label |c> => |1>
OR-label |d> => |1>

XOR-label |a> => |0>
XOR-label |b> => |1>
XOR-label |c> => |1>
XOR-label |d> => |0>

AND-label |a> => |0>
AND-label |b> => |0>
AND-label |c> => |0>
AND-label |d> => |1>
Now define some example input patterns:
input-pattern |u> => |x0> + 0|x1> + 0|x2>
input-pattern |v> => |x0> + 0|x1> + |x2>
input-pattern |x> => |x0> + |x1> + 0|x2>
input-pattern |y> => |x0> + |x1> + |x2>
Now define our logic operators:
OR |*> #=> OR-label drop-below[0.7] similar[input-pattern,pattern] |_self>
XOR |*> #=> XOR-label drop-below[0.7] similar[input-pattern,pattern] |_self>
AND |*> #=> AND-label drop-below[0.7] similar[input-pattern,pattern] |_self>
Now take a look at the resulting table:
sa: table[input,OR,XOR,AND] split |u v x y>
+-------+----+-----+-----+
| input | OR | XOR | AND |
+-------+----+-----+-----+
| u     | 0  | 0   | 0   |
| v     | 1  | 1   | 0   |
| x     | 1  | 1   | 0   |
| y     | 1  | 0   | 1   |
+-------+----+-----+-----+
So we are matching our input patterns (u, v, x and y) against our stored patterns (a, b, c and d), and then the logic operators (OR, XOR, AND) choose what logic labels (OR-label, XOR-label, AND-label) to apply to those patterns.

The resulting sw file is here.

Update: Alternatively, we can use True/False. We could define a whole new set of OR-label, XOR-label and AND-label, but it is shorter to just do this:
TF |0> => |False>
TF |1> => |True>

TF-OR |*> #=> TF OR |_self>
TF-XOR |*> #=> TF XOR |_self>
TF-AND |*> #=> TF AND |_self>
Now take a look at the resulting table:
sa: table[input,TF-OR,TF-XOR,TF-AND] split |u v x y>
+-------+-------+--------+--------+
| input | TF-OR | TF-XOR | TF-AND |
+-------+-------+--------+--------+
| u     | False | False  | False  |
| v     | True  | True   | False  |
| x     | True  | True   | False  |
| y     | True  | False  | True   |
+-------+-------+--------+--------+
Cool! And it didn't take much work. In the BKO scheme, renaming kets is quite often very simple and easy to do.

Monday 9 March 2015

the second BKO claim:

Finally, what I have been building to with my similar[op] function operator:

"we can make a lot of progress in pattern recognition if we can find mappings from objects to well-behaved, deterministic, distinctive superpositions"

where:
1) well-behaved means similar objects return similar superpositions
2) deterministic means if you feed in the same object, you get essentially the same superposition (though there is a little lee-way in that it doesn't have to be 100% identical on each run, but close)
3) distinctive means different object types have easily distinguishable superpositions.

And we note that these are all well satisfied by our webpage example. And presumably with not much work we can achieve the same with a lot of types of text document. The real world case of images and sounds will of course be much harder! The good news is that we only need to solve it once. Once we have good mappings from images/sounds to superpositions, I suspect, higher order pattern recognition (ie, more abstract concepts) should be relatively easy.

More uses of the similarity metric coming up!

Update: the exact details of the superposition are irrelevant. They can be anything! They only need to satisfy (1), (2) and (3), and then we are done.

Update: we now have a few of these. For strings letter-ngrams[1,2,3] works pretty well. For DNA sequences, maybe we need longer ngrams, like letter-ngrams[1,2,3,4,5,6] or something close to that. For longer text documents perhaps word-ngrams[1,2,3]. For other documents maybe fragment similarity. I haven't yet written the code, but for images, maybe a starting point (though it needs a lot of further processing) is what I call image-ngrams. Basically decompose an image into say 5*5 squares. Then process each of those squares separately, and so on. I don't have the full details of that yet. Point is, we are making progress on finding object -> superposition mappings.

Update: those mappings above are OK I suppose. But they are mapping objects to similar strings. It would be much preferable to have mappings to superpositions that encode meaning of that object. And that objects with similar meaning map to similar superpositions. For example, what cortical.io is doing mapping words to SDRs (sparse distributed representations). Indeed, I suspect this would go some way towards language translation. Instead of translating by looking up a dictionary, put in a word in one language then find the word in the second language that has the most similar meaning superposition. Presuming the meaning superpositions don't depend too much on the language they come from. I think this is a fairly safe assumption.

Friday 6 March 2015

the main event: pattern recognition of websites

We are finally there! We deliberately left |website 11> out of our average website hashes. Now, given those as an input, do they classify correctly?

Here is the BKO:
-- define the list of average websites:
|ave list> => |average abc> + |average adelaidenow> + |average slashdot> + |average smh> + |average wikipedia> + |average youtube>

-- we want average hash to be distinct from the other hashes:
|null> => map[hash-4B,average-hash-4B] "" |ave list>

-- now, let's see how well these patterns recognize the pages we left out of our average:
result |abc 11> => 100 similar[hash-4B,average-hash-4B] |abc 11>
result |adelaidenow 11> => 100 similar[hash-4B,average-hash-4B] |adelaidenow 11>
result |slashdot 11> => 100 similar[hash-4B,average-hash-4B] |slashdot 11>
result |smh 11> => 100 similar[hash-4B,average-hash-4B] |smh 11>
result |wikipedia 11> => 100 similar[hash-4B,average-hash-4B] |wikipedia 11>
result |youtube 11> => 100 similar[hash-4B,average-hash-4B] |youtube 11>

-- tidy results:
tidy-result |abc 11> => drop-below[40] result |_self>
tidy-result |adelaidenow 11> => drop-below[40] result |_self>
tidy-result |slashdot 11> => drop-below[40] result |_self>
tidy-result |smh 11> => drop-below[40] result |_self>
tidy-result |wikipedia 11> => drop-below[40] result |_self>
tidy-result |youtube 11> => drop-below[40] result |_self>
And now, drum-roll, the results!
sa: load improved-fragment-webpages.sw
sa: load create-average-website-fragments.sw
sa: load create-website-pattern-recognition-matrix.sw

sa: matrix[result]
[ average abc         ] = [  91.70  28.73  25.76  37.77  29.45  24.33  ] [ abc 11         ]
[ average adelaidenow ]   [  28.77  78.11  26.71  29.85  25.25  28.18  ] [ adelaidenow 11 ]
[ average slashdot    ]   [  25.76  26.88  79.05  28.27  26.86  23.20  ] [ slashdot 11    ]
[ average smh         ]   [  37.80  29.75  28.16  85.55  32.06  24.95  ] [ smh 11         ]
[ average wikipedia   ]   [  29.71  25.25  26.91  31.86  85.19  22.09  ] [ wikipedia 11   ]
[ average youtube     ]   [  24.32  28.18  23.47  24.92  21.94  82.12  ] [ youtube 11     ]

sa: matrix[tidy-result]
[ average abc         ] = [  91.70  0      0      0      0      0      ] [ abc 11         ]
[ average adelaidenow ]   [  0      78.11  0      0      0      0      ] [ adelaidenow 11 ]
[ average slashdot    ]   [  0      0      79.05  0      0      0      ] [ slashdot 11    ]
[ average smh         ]   [  0      0      0      85.55  0      0      ] [ smh 11         ]
[ average wikipedia   ]   [  0      0      0      0      85.19  0      ] [ wikipedia 11   ]
[ average youtube     ]   [  0      0      0      0      0      82.12  ] [ youtube 11     ]
Finally, let's look at the discrimination. ie the difference between the highest matching result and the second highest:
sa: discrimination |*> #=> discrim result |_self>
sa: table[page,discrimination] rel-kets[result] |>
+----------------+----------------+
| page           | discrimination |
+----------------+----------------+
| abc 11         | 53.90          |
| adelaidenow 11 | 48.36          |
| slashdot 11    | 50.89          |
| smh 11         | 47.78          |
| wikipedia 11   | 53.14          |
| youtube 11     | 53.94          |
+----------------+----------------+
There we have it. Discrimination on the order of 50%! That is good.

Heaps more to come!

website similarity matrices

This time, take a look at how similar websites are to themselves over the 11 days.

Here is the BKO:
-- create website similarity matrices:
-- list of abc websites, note we include |abc 11> and |average abc>
|full abc list> => |abc 1> + |abc 2> + |abc 3> + |abc 4> + |abc 5> + |abc 6> + |abc 7> + |abc 8> + |abc 9> + |abc 10> + |abc 11> + |average abc>

-- we want abc-hash to be distinct from standard hash, to reduce the matrix to abc only
abc-hash-4B |*> #=> hash-4B |_self>
|null> => map[abc-hash-4B] "" |full abc list>

-- we want the abc-simm to be distinct from the standard simm, to reduce the matrix to abc only
abc-simm |*> #=> 100 self-similar[abc-hash-4B] |_self>
|null> => map[abc-simm,abc-similarity] "" |full abc list>

-- now the rest of them:
|full adelaidenow list> => |adelaidenow 1> + |adelaidenow 2> + |adelaidenow 3> + |adelaidenow 4> + |adelaidenow 5> + |adelaidenow 6> + |adelaidenow 7> + |adelaidenow 8> + |adelaidenow 9> + |adelaidenow 10> + |adelaidenow 11> + |average adelaidenow>
adelaidenow-hash-4B |*> #=> hash-4B |_self>
|null> => map[adelaidenow-hash-4B] "" |full adelaidenow list>
adelaidenow-simm |*> #=> 100 self-similar[adelaidenow-hash-4B] |_self>
|null> => map[adelaidenow-simm,adelaidenow-similarity] "" |full adelaidenow list>

|full slashdot list> => |slashdot 1> + |slashdot 2> + |slashdot 3> + |slashdot 4> + |slashdot 5> + |slashdot 6> + |slashdot 7> + |slashdot 8> + |slashdot 9> + |slashdot 10> + |slashdot 11> + |average slashdot>
slashdot-hash-4B |*> #=> hash-4B |_self>
|null> => map[slashdot-hash-4B] "" |full slashdot list>
slashdot-simm |*> #=> 100 self-similar[slashdot-hash-4B] |_self>
|null> => map[slashdot-simm,slashdot-similarity] "" |full slashdot list>

|full smh list> => |smh 1> + |smh 2> + |smh 3> + |smh 4> + |smh 5> + |smh 6> + |smh 7> + |smh 8> + |smh 9> + |smh 10> + |smh 11> + |average smh>
smh-hash-4B |*> #=> hash-4B |_self>
|null> => map[smh-hash-4B] "" |full smh list>
smh-simm |*> #=> 100 self-similar[smh-hash-4B] |_self>
|null> => map[smh-simm,smh-similarity] "" |full smh list>

|full wikipedia list> => |wikipedia 1> + |wikipedia 2> + |wikipedia 3> + |wikipedia 4> + |wikipedia 5> + |wikipedia 6> + |wikipedia 7> + |wikipedia 8> + |wikipedia 9> + |wikipedia 10> + |wikipedia 11> + |average wikipedia>
wikipedia-hash-4B |*> #=> hash-4B |_self>
|null> => map[wikipedia-hash-4B] "" |full wikipedia list>
wikipedia-simm |*> #=> 100 self-similar[wikipedia-hash-4B] |_self>
|null> => map[wikipedia-simm,wikipedia-similarity] "" |full wikipedia list>

|full youtube list> => |youtube 1> + |youtube 2> + |youtube 3> + |youtube 4> + |youtube 5> + |youtube 6> + |youtube 7> + |youtube 8> + |youtube 9> + |youtube 10> + |youtube 11> + |average youtube>
youtube-hash-4B |*> #=> hash-4B |_self>
|null> => map[youtube-hash-4B] "" |full youtube list>
youtube-simm |*> #=> 100 self-similar[youtube-hash-4B] |_self>
|null> => map[youtube-simm,youtube-similarity] "" |full youtube list>
Here are the resulting matrices:
-- load the data:
sa: load improved-fragment-webpages.sw
sa: load create-average-website-fragments.sw
sa: load create-website-similarity-matrices.sw

-- show our resulting matrices:
sa: matrix[abc-similarity]
[ abc 1       ] = [  100.00  95.52   95.03   91.85   91.36   91.86   91.88   91.85   92.19   92.19   91.42   93.47   ] [ abc 1       ]
[ abc 2       ]   [  95.52   100.00  95.50   91.86   91.38   91.91   92.04   91.71   92.41   92.47   91.25   93.51   ] [ abc 2       ]
[ abc 3       ]   [  95.03   95.50   100.00  91.80   91.32   92.03   92.10   91.66   92.41   92.41   91.20   93.46   ] [ abc 3       ]
[ abc 4       ]   [  91.85   91.86   91.80   100.00  92.64   92.55   91.80   92.19   91.69   91.68   92.13   92.93   ] [ abc 4       ]
[ abc 5       ]   [  91.36   91.38   91.32   92.64   100.00  92.44   91.40   91.85   91.29   91.25   91.41   92.56   ] [ abc 5       ]
[ abc 6       ]   [  91.86   91.91   92.03   92.55   92.44   100.00  92.95   92.78   91.93   91.90   91.59   93.23   ] [ abc 6       ]
[ abc 7       ]   [  91.88   92.04   92.10   91.80   91.40   92.95   100.00  93.05   93.13   93.02   91.74   93.16   ] [ abc 7       ]
[ abc 8       ]   [  91.85   91.71   91.66   92.19   91.85   92.78   93.05   100.00  93.71   93.52   92.06   93.40   ] [ abc 8       ]
[ abc 9       ]   [  92.19   92.41   92.41   91.69   91.29   91.93   93.13   93.71   100.00  95.57   91.56   93.46   ] [ abc 9       ]
[ abc 10      ]   [  92.19   92.47   92.41   91.68   91.25   91.90   93.02   93.52   95.57   100.00  91.61   93.45   ] [ abc 10      ]
[ abc 11      ]   [  91.42   91.25   91.20   92.13   91.41   91.59   91.74   92.06   91.56   91.61   100.00  91.70   ] [ abc 11      ]
[ average abc ]   [  93.47   93.51   93.46   92.93   92.56   93.23   93.16   93.40   93.46   93.45   91.70   100.00  ] [ average abc ]

sa: matrix[adelaidenow-similarity]
[ adelaidenow 1       ] = [  100.00  86.13   83.22   78.11   77.27   76.23   75.86   75.61   75.79   76.08   76.13   80.64   ] [ adelaidenow 1       ]
[ adelaidenow 2       ]   [  86.13   100.00  87.38   81.46   77.52   77.60   77.07   76.90   76.26   77.26   76.53   82.36   ] [ adelaidenow 2       ]
[ adelaidenow 3       ]   [  83.22   87.38   100.00  83.60   78.24   77.29   76.68   76.56   76.45   77.41   76.22   82.18   ] [ adelaidenow 3       ]
[ adelaidenow 4       ]   [  78.11   81.46   83.60   100.00  83.39   78.50   77.28   77.24   75.75   76.77   76.67   81.82   ] [ adelaidenow 4       ]
[ adelaidenow 5       ]   [  77.27   77.52   78.24   83.39   100.00  81.76   77.29   76.84   76.31   76.39   77.43   81.12   ] [ adelaidenow 5       ]
[ adelaidenow 6       ]   [  76.23   77.60   77.29   78.50   81.76   100.00  79.82   78.85   76.34   77.09   76.98   81.08   ] [ adelaidenow 6       ]
[ adelaidenow 7       ]   [  75.86   77.07   76.68   77.28   77.29   79.82   100.00  84.92   78.38   77.13   76.90   81.08   ] [ adelaidenow 7       ]
[ adelaidenow 8       ]   [  75.61   76.90   76.56   77.24   76.84   78.85   84.92   100.00  82.06   79.31   77.56   81.59   ] [ adelaidenow 8       ]
[ adelaidenow 9       ]   [  75.79   76.26   76.45   75.75   76.31   76.34   78.38   82.06   100.00  85.68   78.15   80.79   ] [ adelaidenow 9       ]
[ adelaidenow 10      ]   [  76.08   77.26   77.41   76.77   76.39   77.09   77.13   79.31   85.68   100.00  82.97   80.86   ] [ adelaidenow 10      ]
[ adelaidenow 11      ]   [  76.13   76.53   76.22   76.67   77.43   76.98   76.90   77.56   78.15   82.97   100.00  78.11   ] [ adelaidenow 11      ]
[ average adelaidenow ]   [  80.64   82.36   82.18   81.82   81.12   81.08   81.08   81.59   80.79   80.86   78.11   100.00  ] [ average adelaidenow ]

sa: matrix[slashdot-similarity]
[ average slashdot ] = [  100.00  81.12   80.99   81.20   81.45   81.67   81.31   81.17   81.37   81.24   81.09   79.05   ] [ average slashdot ]
[ slashdot 1       ]   [  81.12   100.00  79.43   77.62   79.45   78.59   78.19   78.66   78.62   77.47   79.05   79.06   ] [ slashdot 1       ]
[ slashdot 2       ]   [  80.99   79.43   100.00  79.26   78.77   78.15   79.32   77.31   77.63   78.11   78.44   77.96   ] [ slashdot 2       ]
[ slashdot 3       ]   [  81.20   77.62   79.26   100.00  79.08   78.22   79.18   78.34   77.85   78.71   78.50   78.37   ] [ slashdot 3       ]
[ slashdot 4       ]   [  81.45   79.45   78.77   79.08   100.00  81.20   78.27   79.20   78.12   78.15   78.83   78.82   ] [ slashdot 4       ]
[ slashdot 5       ]   [  81.67   78.59   78.15   78.22   81.20   100.00  78.58   79.85   79.29   79.04   78.56   77.78   ] [ slashdot 5       ]
[ slashdot 6       ]   [  81.31   78.19   79.32   79.18   78.27   78.58   100.00  78.62   78.54   79.33   78.07   78.00   ] [ slashdot 6       ]
[ slashdot 7       ]   [  81.17   78.66   77.31   78.34   79.20   79.85   78.62   100.00  80.17   78.86   78.40   78.65   ] [ slashdot 7       ]
[ slashdot 8       ]   [  81.37   78.62   77.63   77.85   78.12   79.29   78.54   80.17   100.00  79.38   79.60   79.02   ] [ slashdot 8       ]
[ slashdot 9       ]   [  81.24   77.47   78.11   78.71   78.15   79.04   79.33   78.86   79.38   100.00  78.32   78.34   ] [ slashdot 9       ]
[ slashdot 10      ]   [  81.09   79.05   78.44   78.50   78.83   78.56   78.07   78.40   79.60   78.32   100.00  81.39   ] [ slashdot 10      ]
[ slashdot 11      ]   [  79.05   79.06   77.96   78.37   78.82   77.78   78.00   78.65   79.02   78.34   81.39   100.00  ] [ slashdot 11      ]

sa: matrix[smh-similarity]
[ average smh ] = [  100.00  87.76   88.16   87.95   87.62   86.82   87.31   87.12   87.72   87.54   87.61   85.55   ] [ average smh ]
[ smh 1       ]   [  87.76   100.00  89.44   87.69   86.23   85.33   85.30   84.97   85.34   84.70   85.04   84.75   ] [ smh 1       ]
[ smh 2       ]   [  88.16   89.44   100.00  89.80   86.25   85.27   85.58   85.36   85.54   85.61   85.38   85.40   ] [ smh 2       ]
[ smh 3       ]   [  87.95   87.69   89.80   100.00  86.81   85.04   85.31   85.40   85.21   85.19   85.47   84.93   ] [ smh 3       ]
[ smh 4       ]   [  87.62   86.23   86.25   86.81   100.00  86.63   86.12   85.06   85.64   85.15   85.34   85.00   ] [ smh 4       ]
[ smh 5       ]   [  86.82   85.33   85.27   85.04   86.63   100.00  85.24   84.36   85.20   84.55   85.00   84.99   ] [ smh 5       ]
[ smh 6       ]   [  87.31   85.30   85.58   85.31   86.12   85.24   100.00  86.19   86.59   85.03   85.26   85.81   ] [ smh 6       ]
[ smh 7       ]   [  87.12   84.97   85.36   85.40   85.06   84.36   86.19   100.00  86.36   85.48   85.38   85.48   ] [ smh 7       ]
[ smh 8       ]   [  87.72   85.34   85.54   85.21   85.64   85.20   86.59   86.36   100.00  87.76   86.94   85.89   ] [ smh 8       ]
[ smh 9       ]   [  87.54   84.70   85.61   85.19   85.15   84.55   85.03   85.48   87.76   100.00  90.65   85.16   ] [ smh 9       ]
[ smh 10      ]   [  87.61   85.04   85.38   85.47   85.34   85.00   85.26   85.38   86.94   90.65   100.00  85.75   ] [ smh 10      ]
[ smh 11      ]   [  85.55   84.75   85.40   84.93   85.00   84.99   85.81   85.48   85.89   85.16   85.75   100.00  ] [ smh 11      ]

sa: matrix[wikipedia-similarity]
[ average wikipedia ] = [  100.00  87.67   87.51   87.82   85.86   88.06   88.10   87.71   86.65   87.33   87.47   85.19   ] [ average wikipedia ]
[ wikipedia 1       ]   [  87.67   100.00  88.60   87.80   85.57   85.41   85.93   84.48   84.13   83.71   84.48   84.21   ] [ wikipedia 1       ]
[ wikipedia 2       ]   [  87.51   88.60   100.00  89.28   84.95   85.85   86.16   85.72   82.95   83.88   86.01   82.89   ] [ wikipedia 2       ]
[ wikipedia 3       ]   [  87.82   87.80   89.28   100.00  85.36   86.57   86.41   85.67   83.09   84.50   85.48   83.13   ] [ wikipedia 3       ]
[ wikipedia 4       ]   [  85.86   85.57   84.95   85.36   100.00  84.04   83.77   82.42   83.65   82.37   82.08   83.32   ] [ wikipedia 4       ]
[ wikipedia 5       ]   [  88.06   85.41   85.85   86.57   84.04   100.00  88.34   86.72   84.71   85.80   85.91   84.22   ] [ wikipedia 5       ]
[ wikipedia 6       ]   [  88.10   85.93   86.16   86.41   83.77   88.34   100.00  87.70   85.43   85.85   86.64   84.31   ] [ wikipedia 6       ]
[ wikipedia 7       ]   [  87.71   84.48   85.72   85.67   82.42   86.72   87.70   100.00  86.18   87.02   88.46   85.24   ] [ wikipedia 7       ]
[ wikipedia 8       ]   [  86.65   84.13   82.95   83.09   83.65   84.71   85.43   86.18   100.00  85.88   85.55   86.28   ] [ wikipedia 8       ]
[ wikipedia 9       ]   [  87.33   83.71   83.88   84.50   82.37   85.80   85.85   87.02   85.88   100.00  88.10   86.46   ] [ wikipedia 9       ]
[ wikipedia 10      ]   [  87.47   84.48   86.01   85.48   82.08   85.91   86.64   88.46   85.55   88.10   100.00  87.17   ] [ wikipedia 10      ]
[ wikipedia 11      ]   [  85.19   84.21   82.89   83.13   83.32   84.22   84.31   85.24   86.28   86.46   87.17   100.00  ] [ wikipedia 11      ]

sa: matrix[youtube-similarity]
[ average youtube ] = [  100.00  85.21   84.63   85.38   85.98   86.04   85.33   84.43   84.90   81.10   84.43   82.12   ] [ average youtube ]
[ youtube 1       ]   [  85.21   100.00  83.96   84.76   85.73   85.18   85.04   80.41   81.58   77.43   82.51   79.80   ] [ youtube 1       ]
[ youtube 2       ]   [  84.63   83.96   100.00  87.30   85.63   84.66   81.56   81.12   79.96   79.08   78.90   79.32   ] [ youtube 2       ]
[ youtube 3       ]   [  85.38   84.76   87.30   100.00  86.54   87.12   83.85   80.31   80.86   78.14   80.15   78.80   ] [ youtube 3       ]
[ youtube 4       ]   [  85.98   85.73   85.63   86.54   100.00  89.46   85.78   81.19   82.22   76.97   81.13   79.50   ] [ youtube 4       ]
[ youtube 5       ]   [  86.04   85.18   84.66   87.12   89.46   100.00  86.87   81.77   81.96   77.08   81.08   79.79   ] [ youtube 5       ]
[ youtube 6       ]   [  85.33   85.04   81.56   83.85   85.78   86.87   100.00  82.21   82.71   78.36   81.86   80.81   ] [ youtube 6       ]
[ youtube 7       ]   [  84.43   80.41   81.12   80.31   81.19   81.77   82.21   100.00  85.97   82.26   84.98   85.98   ] [ youtube 7       ]
[ youtube 8       ]   [  84.90   81.58   79.96   80.86   82.22   81.96   82.71   85.97   100.00  81.99   87.14   84.61   ] [ youtube 8       ]
[ youtube 9       ]   [  81.10   77.43   79.08   78.14   76.97   77.08   78.36   82.26   81.99   100.00  82.12   82.82   ] [ youtube 9       ]
[ youtube 10      ]   [  84.43   82.51   78.90   80.15   81.13   81.08   81.86   84.98   87.14   82.12   100.00  86.58   ] [ youtube 10      ]
[ youtube 11      ]   [  82.12   79.80   79.32   78.80   79.50   79.79   80.81   85.98   84.61   82.82   86.58   100.00  ] [ youtube 11      ]
OK. That is kind of cool. Though the matrices will presumably line-wrap if your screen is too small. Take home message, all webpages are greater than 75% similar with themselves over the 11 day period. Which I guess means we don't even need to average over 10 days! Presumably the average will give better results though.