The Semantic DB Project: May 2015

Saturday 30 May 2015

announcing: command line guess file name tool.

OK. Just a quick one. A command line tool that given a string, spits out the file-names that most closely match it (using unscaled simm). Here is the code, and once again, I have made it open source.

$ ./guess.py

Usage: ./guess.py string [directory]

If the directory is not given, then use the current directory.

And a quick example:

$ ./guess.py wiki work-on-wikipedia

string to match:      wiki
directory to search:  work-on-wikipedia
====================================

50 %     wiki.xml
23.08 %  1000-wiki.xml
23.08 %  5000-wiki.xml
22.22 %  wikipedia-links.sw
13.64 %  play_with_wikipedia.py
12.5 %   trial-wikipedia-links.sw
12 %     fragment_wikipedia_xml.py
10.34 %  fast-write-wikipedia-links.sw
8.82 %   play_with_wikipedia__fast_write.py

So, that should be clear enough. And whether anything like this is already out there, I don't know. I presume so. eg, using the edit-distance algorithm as used in spell-checkers could be used in a similar way to my simm.

Heh. Perhaps I gave a bad example! In the above case, a simple:

$ ls -1 work-on-wikipedia/ | grep wiki
1000-wiki.xml
5000-wiki.xml
fast-write-wikipedia-links.sw
fragment_wikipedia_xml.py
play_with_wikipedia.py
play_with_wikipedia__fast_write.py
trial-wikipedia-links.sw
wiki.xml
wikipedia-links.sw

would have given similar results!

So here is another example of guess:

$ ./guess.py "sam fred" sw-examples/ | head

string to match:      sam fred
directory to search:  sw-examples
====================================

30.77 %  small-fred.sw
26.67 %  Freds-family.sw
26.32 %  fred-sam-friends.sw
22.22 %  shares.sw
16.67 %  saved-commands.txt

$ ./guess.py "fred sam" sw-examples/ | head

string to match:      fred sam
directory to search:  sw-examples
====================================

33.33 %  Freds-family.sw
31.58 %  fred-sam-friends.sw
25 %     frog.sw
23.08 %  small-fred.sw
20 %     friends.sw

So some things to note are:
1) "fred sam" and "sam fred" give essentially the same answer
2) I made it case insenstive
3) we don't need to use any regexp, which we would have to do if using grep in a similar manner.
4) we could possibly use this to help find the exact ket when we only partly know its name, and the sw file is too large to search manually.

Anyway, should be useful here and there.

Tuesday 19 May 2015

comment on ket independence

Just a quick comment on the idea of ket independence.

We can say |x_i> for i in {1,2,..,n} are independent with respect to the op-sequence "op4 op3 op2 op1" if:
op4 op3 op2 op1 (|x_1> + |x_2> + ... + |x_n>)
= op4 op3 op2 op1 |x_1>
+ op4 op3 op2 op1 |x_2>
...
+ op4 op3 op2 op1 |x_n>

Not yet sure where this will be useful, but figure I should mention it.

Tuesday 12 May 2015

average-categorize websites

Now for a more interesting example. Recall work on pattern recognition of websites. In that case, we constructed the averages manually. We manually took the average of abc {1,2,...,10}, adelaidenow {1,2,...,10} etc. Then applied simm to webpage-11. This time, we get average-categorize to find our sample classes (again, webpage-11's are filtered out). Here is the BKO:

sa: load improved-fragment-webpages.sw
sa: load label-training-data-for-website-fragments.sw
sa: average-categorize[training-hash-4B,0.7,phi,average-cat]
sa: load create-cat-ave-pat-rec-matrix.sw

Here is the result:

sa: matrix[result]
[ phi: 1 ] = [  91.69  28.73  25.76  37.77  29.45  24.33  ] [ abc 11         ]
[ phi: 2 ]   [  28.77  78.04  26.72  29.86  25.25  28.19  ] [ adelaidenow 11 ]
[ phi: 3 ]   [  25.76  26.88  79.05  28.28  26.86  23.21  ] [ slashdot 11    ]
[ phi: 4 ]   [  37.8   29.75  28.16  85.54  32.06  24.95  ] [ smh 11         ]
[ phi: 5 ]   [  29.71  25.25  26.91  31.87  85.17  22.09  ] [ wikipedia 11   ]
[ phi: 6 ]   [  24.32  28.18  23.47  24.92  21.94  82.02  ] [ youtube 11     ]

Which counts as a success! This matrix is essentially identical to this one. Except this time, we didn't specify the classes, average-categorize did that for us, resulting in phi {1,2,3,4,5,6}.

So, a nice example of unlabeled learning. Though admittedly, webpage superpositions are rather well behaved. Certainly better behaved than the adult wage prediction example! In the sense that different classes have distinct superpositions. The wage prediction example completely failed that test.

Now we have distinct classes we can give them names. Pretty simply actually:

M |phi: 1> => |abc>
M |phi: 2> => |adelaidenow>
M |phi: 3> => |slashdot>
M |phi: 4> => |smh>
M |phi: 5> => |wikipedia>
M |phi: 6> => |youtube>

Now, take a look:

sa: matrix[M,result]
[ abc         ] = [  1  0  0  0  0  0  ] [  91.69  28.73  25.76  37.77  29.45  24.33  ] [ abc 11         ]
[ adelaidenow ]   [  0  1  0  0  0  0  ] [  28.77  78.04  26.72  29.86  25.25  28.19  ] [ adelaidenow 11 ]
[ slashdot    ]   [  0  0  1  0  0  0  ] [  25.76  26.88  79.05  28.28  26.86  23.21  ] [ slashdot 11    ]
[ smh         ]   [  0  0  0  1  0  0  ] [  37.8   29.75  28.16  85.54  32.06  24.95  ] [ smh 11         ]
[ wikipedia   ]   [  0  0  0  0  1  0  ] [  29.71  25.25  26.91  31.87  85.17  22.09  ] [ wikipedia 11   ]
[ youtube     ]   [  0  0  0  0  0  1  ] [  24.32  28.18  23.47  24.92  21.94  82.02  ] [ youtube 11     ]

Or, as a merged matrix:

sa: merged-matrix[M,result]
[ abc         ] = [  91.69  28.73  25.76  37.77  29.45  24.33  ] [ abc 11         ]
[ adelaidenow ]   [  28.77  78.04  26.72  29.86  25.25  28.19  ] [ adelaidenow 11 ]
[ slashdot    ]   [  25.76  26.88  79.05  28.28  26.86  23.21  ] [ slashdot 11    ]
[ smh         ]   [  37.8   29.75  28.16  85.54  32.06  24.95  ] [ smh 11         ]
[ wikipedia   ]   [  29.71  25.25  26.91  31.87  85.17  22.09  ] [ wikipedia 11   ]
[ youtube     ]   [  24.32  28.18  23.47  24.92  21.94  82.02  ] [ youtube 11     ]

Presumably humans/children do something similar. They build up and recognize classes of objects before they even have a name for them. Only later after the classes are well specified they discover a label (in the above case, the M learn rules).

average-categorize H-I simple pattern recognition example

Last post I outlined and gave the code for average-categorize. This post a simple example. Recall the simple pattern recognition example, H-I-pat-rec.sw. Let's apply average-categorize to it.

Now, we have these patterns:

x: letter: H
1       1
1       1
1       1
1 1 1 1 1
1       1
1       1
1       1

x: noisy: H
        1
1       1
1       1
1 1 1   1
1
1       1
1       1

x: noisy: H2
1       1
1
1   1 1 1
1 1 1 1 1
1 1     1
1       1
1 1 1   1

x: letter: I
1 1 1 1 1
    1
    1
    1
    1
    1
1 1 1 1 1

x: noisy: I
1 1 1 1
    1


    1
    1
1   1 1 1

x: noisy: I2
1 1     1
  1 1 1
    1
    1
    1 1 1
1 1 1 1
1 1 1 1 1

x: letter: O
1 1 1 1 1
1
1
1
1
1
1 1 1 1 1

Run this code:

average_categorize(C,"pixels",0.65,"phi","pixels")

And now we have:

I think that works pretty well!

Webpage superposition example in the next post.

new function: average-categorize

This one I think will be super useful, and super powerful. I only just came up with it, so I haven't fully explored its properties, but I have already found some. And it's looking promising!

So the idea is, given a collection of points, how do we classify them into groups? I'm pretty sure this is a solved problem, but I wanted to see what I could come up with.

It goes something like this (though see the code for exact details). We have a list of currently known patterns (out_list), and a list of incoming patterns (one). For each incoming pattern (r), compare it with each of the known patterns (sp) using simm. Store the name and similarity value for the best match. At the end of the loop, if the best match is lower than a threshold (recall for simm, higher threshold is a better match, exactly 1 for exact match) then store it as its' own known pattern. Otherwise "average" it with the best matching known pattern, weighted based on its similarity. And in BKO land, simm scales/normalizes superpositions, so to average just means add the superpositions, and we don't need to divide by n.

Here is the code:

def average_categorize(context,parameters):
  try:
    op,t,phi,ave = parameters.split(',')
    t = float(t)
  except:
    return ket("",0)
    
  one = context.relevant_kets(op)
  out_list = []
  for x in one:
    r = x.apply_op(context,op)
    best_k = -1
    best_simm = 0
    for k,sp in enumerate(out_list):
      similarity = silent_simm(r,sp)
      if similarity > best_simm:
        best_k = k
        best_simm = similarity
    if best_k == -1 or best_simm < t:
      out_list.append(r)
    else:
      k = best_k
#      out_list[k] += r
      out_list[k] += r.multiply(best_simm)  # reweight based on result of simm.
  for k,sp in enumerate(out_list):
    context.learn(ave,phi + ": " + str(k+1),sp)
  return ket("average categorize")

Now, I may tweak this code yet, but it is already a good start. I will give a couple of examples in future posts.

Some properties:
1) it converges as you add more examples
2) the order you learn is important, earlier examples have a bigger effect than later ones
3) it reproduces the light-bulb effect, ie stronger examples have a bigger effect
4) it reproduces the reverse-light-bulb effect, ie, weaker examples have essentially no impact
5) I think it is roughly O(n^2), certainly cheaper than my previous categorize code.

I guess I need to explain what I mean by "stronger examples" and "weaker examples". Superpositions have a thing I call currency. Let's drop to the console:

sa: measure-currency |x>
|number: 1>

sa: measure-currency 2.71|y>
|number: 2.71>

sa: measure-currency (|a> + |b> + |c>)
|number: 3>

sa: measure-currency (0.2|a> + 4|b> + 11|c> + 0.3|d>)
|number: 15.5>

So when I talk of "stronger example" I mean r has a high currency, and a "weaker example" I mean r has a small currency.

If we look at the line:

      out_list[k] += r.multiply(best_simm)

then (ignoring the simm reweighting), the higher the currency of r, the more power it has to effect out_list[k]. I'm not sure that is clear! I'm working from mental images, and it is hard to translate them to words.

More to come!

BTW, I don't think I have mentioned this anywhere yet, but we can say an operator "op" preserves currency if:

measure-currency SP == measure-currency op SP

A related definition (and probably a slightly better one) for currency conservation is that if in matrix format the sum of each column is exactly 1, then we can say that operator conserves currency.

Thursday 7 May 2015

an interesting little observation

OK. Just a quick one. There seems to be an interesting analogy between the different layers in the BKO scheme.

Take the BKO:
|y> => op4 op3 op2 op1 "" |x>
which simply corresponds to stepping from superposition to superposition, then storing it in |y>
Well, a level above that, we have something similar with .sw files. Noting that just loading an sw file can do computation (the only time they don't is if all the rules are literal superpositions).

So the analogy of the above, at the sw level is:
load x.sw
load op1.sw
load op2.sw
load op3.sw
load op4.sw
save y.sw

Anyway, I just thought it was an interesting little observation.

announcing: command line file similarity tool

Decided to write a script that spits out the similarity of two files, using my simm. It has three mappings from file to superposition/list (byte, 2-byte and fragment simm) and gives both scaled and unscaled simm results. Here is the code, and I decided to make it open source!

I guess some examples:

-- define some toy files:
$ echo -n "aa" > a.txt
$ echo -n "a" > b.txt

-- find their similarity:
$ ./simm.py a.txt b.txt
file 1:                a.txt
file 2:                b.txt

unscaled:
  byte similarity:     50 %
  2 byte similarity:   50 %

scaled:
  byte similarity:     100 %
  2 byte similarity:   50 %

Next example:

-- define toy files:
$ echo -n "fish" > a.txt
$ echo -n "fsih" > b.txt

-- find similarity:
$ ./simm.py a.txt b.txt
file 1:                a.txt
file 2:                b.txt

unscaled:
  byte similarity:     100 %
  2 byte similarity:   25 %

scaled:
  byte similarity:     100 %
  2 byte similarity:   25 %

A slightly more interesting example, and this time we also use fragment simm:

-- define a couple of sentences:
$ echo "Shakespeare produced most of his known work between 1589 and 1613" > a.txt
$ echo "Shakespeare produced most of his work after 1589" > b.txt

-- find their similarity, and fragment simm on spaces:
$ ./simm.py a.txt b.txt ' '
file 1:                a.txt
file 2:                b.txt

split strings:         ' '

unscaled:
  byte similarity:     71.21 %
  2 byte similarity:   65.15 %
  fragment similarity: 63.64 %

scaled:
  byte similarity:     82.19 %
  2 byte similarity:   66.2 %
  fragment similarity: 63.64 %

Now, on larger files. Recall the simm matrix I made here? Well, let's test a couple of examples:

$ ./simm.py example-files/binary.exe example-files/ebook.txt
file 1:                example-files/binary.exe
file 2:                example-files/ebook.txt

unscaled:
  byte similarity:     9.25 %
  2 byte similarity:   1.92 %

scaled:
  byte similarity:     12.44 %
  2 byte similarity:   2.66 %

$ ./simm.py example-files/ebook.txt example-files/website.html
file 1:                example-files/ebook.txt
file 2:                example-files/website.html

unscaled:
  byte similarity:     24.07 %
  2 byte similarity:   17.82 %

scaled:
  byte similarity:     72.7 %
  2 byte similarity:   47.26 %

$ ./simm.py example-files/encrypted.gpg example-files/zipped.zip
file 1:                example-files/encrypted.gpg
file 2:                example-files/zipped.zip

unscaled:
  byte similarity:     94.06 %
  2 byte similarity:   63.75 %

scaled:
  byte similarity:     97.6 %
  2 byte similarity:   65.2 %

Now, a more interesting fragment simm example. Recall the simm matrices I made here. Well, let's give some examples:

$ ./simm.py webpages-v2/abc-1.html webpages-v2/abc-7.html '<' '>'
file 1:                webpages-v2/abc-1.html
file 2:                webpages-v2/abc-7.html

split strings:         '<' '>'

unscaled:
  byte similarity:     98.28 %
  2 byte similarity:   95.28 %
  fragment similarity: 91.88 %

scaled:
  byte similarity:     98.76 %
  2 byte similarity:   95.65 %
  fragment similarity: 91.88 %

$ ./simm.py webpages-v2/abc-1.html webpages-v2/adelaidenow-1.html '<' '>'
file 1:                webpages-v2/abc-1.html
file 2:                webpages-v2/adelaidenow-1.html

split strings:         '<' '>'

unscaled:
  byte similarity:     11.41 %
  2 byte similarity:   10.63 %
  fragment similarity: 7.4 %

scaled:
  byte similarity:     82.51 %
  2 byte similarity:   61.66 %
  fragment similarity: 28.76 %

$ ./simm.py webpages-v2/abc-1.html webpages-v2/youtube-1.html '<' '>'
file 1:                webpages-v2/abc-1.html
file 2:                webpages-v2/youtube-1.html

split strings:         '<' '>'

unscaled:
  byte similarity:     13.23 %
  2 byte similarity:   10.87 %
  fragment similarity: 10.29 %

scaled:
  byte similarity:     76.47 %
  2 byte similarity:   49.16 %
  fragment similarity: 24.27 %

And finally, some text files:

$ ./simm.py text/ebook-Alices_Adventures_in_Wonderland_11.txt text/ebook-Tom_Sawyer_74.txt ' '
file 1:                text/ebook-Alices_Adventures_in_Wonderland_11.txt
file 2:                text/ebook-Tom_Sawyer_74.txt

split strings:         ' '

unscaled:
  byte similarity:     40.02 %
  2 byte similarity:   38.49 %
  fragment similarity: 28.69 %

scaled:
  byte similarity:     95.47 %
  2 byte similarity:   86.92 %
  fragment similarity: 54.34 %

$ ./simm.py text/ebook-Alices_Adventures_in_Wonderland_11.txt text/ebook-moby-shakespeare.txt ' '
file 1:                text/ebook-Alices_Adventures_in_Wonderland_11.txt
file 2:                text/ebook-moby-shakespeare.txt

split strings:         ' '

unscaled:
  byte similarity:     3.13 %
  2 byte similarity:   3.11 %
  fragment similarity: 2.5 %

scaled:
  byte similarity:     90.8 %
  2 byte similarity:   79.5 %
  fragment similarity: 41.8 %

$ ./simm.py text/WP-Adelaide.txt text/WP-Australia.txt ' '
file 1:                text/WP-Adelaide.txt
file 2:                text/WP-Australia.txt

split strings:         ' '

unscaled:
  byte similarity:     88.89 %
  2 byte similarity:   81.13 %
  fragment similarity: 46.52 %

scaled:
  byte similarity:     95.47 %
  2 byte similarity:   86.49 %
  fragment similarity: 48.83 %

$ ./simm.py text/WP-Adelaide.txt text/WP-physics.txt ' '
file 1:                text/WP-Adelaide.txt
file 2:                text/WP-physics.txt

split strings:         ' '

unscaled:
  byte similarity:     54.64 %
  2 byte similarity:   50.26 %
  fragment similarity: 23.42 %

scaled:
  byte similarity:     90.3 %
  2 byte similarity:   77.31 %
  fragment similarity: 35.54 %

And one final example. This is what happens if you try the wrong split strings for a given document type. eg here less-than and greater-than on a text file:

$ ./simm.py text/WP-Adelaide.txt text/WP-Australia.txt '<' '>'
file 1:                text/WP-Adelaide.txt
file 2:                text/WP-Australia.txt

split strings:         '<' '>'

unscaled:
  byte similarity:     88.89 %
  2 byte similarity:   81.13 %
  fragment similarity: 0 %

scaled:
  byte similarity:     95.47 %
  2 byte similarity:   86.49 %
  fragment similarity: 0 %

Note the 0% fragment similarity.

That's it for this post. BTW, in looking around to see if others have written file similarity tools I found simhash. Heh, almost certainly faster than my method!

BTW, the above should work just fine on a large range of document types. You just need to choose the right split strings. eg, for C we might use:

./simm.py code-1.c code-2.c ' ' '{' '}' '(' ')' ';'

ie, space, curly-brackets, normal brackets and semi-colon.

Indeed, we could also apply it to DNA (though presumably real researchers already have tools to do this!). So, we can specify split-strings, and apply it to text files of DNA sequences.
eg:

./simm.py DNA-1.txt DNA-2.txt 'GAAATTCCCA' 'ATACCACT' 'AACCACACAC' 'TTAGGG'

That we can do this should not be a surprise, since my fragment similarity idea was originally motivated by the idea of gel electrophoresis.

OK. Let's do a worked example!
Let's first choose 'TTAGGG' as our split string.
Let's choose these to be our 2 DNA sequences:
GAAATTCCCA TTAGGG ATACCACT
AACCACACAC TTAGGG GAAATTCCCA
And by construction, we expect 50% similarity.
Let's take a look:

$ echo -n "GAAATTCCCATTAGGGATACCACT" > a.txt
$ echo -n "AACCACACACTTAGGGGAAATTCCCA" > b.txt

$ ./simm.py a.txt b.txt 'TTAGGG'
file 1:                a.txt
file 2:                b.txt

split strings:         'TTAGGG'

unscaled:
  byte similarity:     84.62 %
  2 byte similarity:   73.08 %
  fragment similarity: 50 %

scaled:
  byte similarity:     89.1 %
  2 byte similarity:   75.64 %
  fragment similarity: 50 %

Success!

Finally, my code is somewhat slow! A C implementation would be nice.

Update: if your files are line based, '\n' (ie, new-line) should usually be a sufficient split string.
If you have csv files say, then maybe '\n' and ',' (ie, new-line and comma).

Update: talking of speeding it up, if we use lists of bits, we can speed it up somewhat.
wf = sum-of-bits(f)
wg = sum-of-bits(g)
wfg = sum-of-bits(xor(f,g))
result = (wf + wg - wfg)/2*max(wf,wg)

Update: OK. I given this "speed up" some thought, and decided it will not be as good as the full simm. A for example is, consider C code. If we use the bit method a file containing 1 "int" is "identical" to another containing say 200 "int". Which is not all that useful!