Wednesday 25 November 2015

revisiting wikipedia inverse-links-to semantic similarities

Nothing new here, just some more wikipedia semantic similarity examples. The motivation was partly word2vec. They have some examples of semantic similarity using their word vectors. I had planned to write my own word2sp but so far my idea failed! And I couldn't use their word vectors because they have negative coeffs, while my similarity metric requires positive coeffs.

So in the mean-time I decided to re-run my wikipedia code. I tried to use 300,000 wikipedia links sw file, but that failed too. It needed too much RAM and took too long to run. I thought I had used it in the past, in which case I don't know why it failed this time!

Here is the first word2vec example (distance to "france"):
                 Word       Cosine distance
-------------------------------------------
                spain              0.678515
              belgium              0.665923
          netherlands              0.652428
                italy              0.633130
          switzerland              0.622323
           luxembourg              0.610033
             portugal              0.577154
               russia              0.571507
              germany              0.563291
            catalonia              0.534176
Here it is using my code:
sa: load 30k--wikipedia-links.sw
sa: find-inverse[links-to]
sa: T |*> #=> table[page,coeff] select[1,200] 100 self-similar[inverse-links-to] |_self>
sa: T |WP: France>
+--------------------------------+--------+
| page                           | coeff  |
+--------------------------------+--------+
| France                         | 100.0  |
| Germany                        | 31.771 |
| United_Kingdom                 | 30.537 |
| Italy                          | 27.452 |
| Spain                          | 23.566 |
| United_States                  | 20.152 |
| Japan                          | 19.556 |
| Netherlands                    | 19.309 |
| Russia                         | 18.877 |
| Canada                         | 18.384 |
| Europe                         | 17.273 |
| India                          | 17.212 |
| China                          | 16.78  |
| Paris                          | 16.595 |
| England                        | 16.286 |
| World_War_II                   | 15.923 |
| Australia                      | 15.238 |
| Soviet_Union                   | 14.867 |
| Belgium                        | 14.189 |
| Poland                         | 14.127 |
| Portugal                       | 13.819 |
| World_War_I                    | 13.757 |
| Austria                        | 13.695 |
| Sweden                         | 13.572 |
| Switzerland                    | 13.51  |
| Egypt                          | 12.647 |
| European_Union                 | 12.4   |
| Brazil                         | 12.338 |
| United_Nations                 | 12.091 |
| Greece                         | 11.906 |
| London                         | 11.906 |
| Israel                         | 11.783 |
| Turkey                         | 11.783 |
| Denmark                        | 11.598 |
| French_language                | 11.536 |
| Norway                         | 11.413 |
| Latin                          | 10.611 |
| Rome                           | 10.364 |
| Mexico                         | 10.364 |
| English_language               | 9.994  |
| South_Africa                   | 9.747  |
...
which works pretty well I must say.

Here is the next word2vec example (distance to San Francisco):
                 Word       Cosine distance
-------------------------------------------
          los_angeles              0.666175
          golden_gate              0.571522
              oakland              0.557521
           california              0.554623
            san_diego              0.534939
             pasadena              0.519115
              seattle              0.512098
                taiko              0.507570
              houston              0.499762
     chicago_illinois              0.491598
Here it is using my code:
+---------------------------------------+--------+
| page                                  | coeff  |
+---------------------------------------+--------+
| San_Francisco                         | 100.0  |
| Los_Angeles                           | 16.704 |
| Chicago                               | 15.919 |
| 1924                                  | 15.522 |
| 1916                                  | 14.566 |
| California                            | 14.502 |
| 1915                                  | 14.286 |
| 2014                                  | 14.217 |
| 1933                                  | 14.031 |
| 1913                                  | 14.006 |
| 1918                                  | 14.0   |
| 1930                                  | 14.0   |
| Philadelphia                          | 13.99  |
| 1925                                  | 13.984 |
| 1931                                  | 13.904 |
| 1920                                  | 13.802 |
| 1932                                  | 13.776 |
| 1942                                  | 13.744 |
| 1999                                  | 13.725 |
...
Hrmm... that didn't work so great. I wonder why.

Here is the next word2vec example:
Enter word or sentence (EXIT to break): /en/geoffrey_hinton

                        Word       Cosine distance
--------------------------------------------------
           /en/marvin_minsky              0.457204
             /en/paul_corkum              0.443342
 /en/william_richard_peltier              0.432396
           /en/brenda_milner              0.430886
    /en/john_charles_polanyi              0.419538
          /en/leslie_valiant              0.416399
         /en/hava_siegelmann              0.411895
            /en/hans_moravec              0.406726
         /en/david_rumelhart              0.405275
             /en/godel_prize              0.405176
And here it is using my code:
+------------------------------------------------------------+--------+
| page                                                       | coeff  |
+------------------------------------------------------------+--------+
| Geoffrey_Hinton                                            | 100    |
| perceptron                                                 | 66.667 |
| Tom_M._Mitchell                                            | 66.667 |
| computational_learning_theory                              | 66.667 |
| Nils_Nilsson_(researcher)                                  | 66.667 |
| beam_search                                                | 66.667 |
| Raj_Reddy                                                  | 50     |
| AI_effect                                                  | 40     |
| ant_colony_optimization                                    | 40     |
| List_of_artificial_intelligence_projects                   | 33.333 |
| AI-complete                                                | 33.333 |
| Cyc                                                        | 33.333 |
| Hugo_de_Garis                                              | 33.333 |
| Joyce_K._Reynolds                                          | 33.333 |
| Kleene_closure                                             | 33.333 |
| Mondegreen                                                 | 33.333 |
| Supervised_learning                                        | 33.333 |
...
And now a couple more examples:
sa: T |WP: Linux>
+------------------------------------------------+--------+
| page                                           | coeff  |
+------------------------------------------------+--------+
| Linux                                          | 100.0  |
| Microsoft_Windows                              | 46.629 |
| operating_system                               | 37.333 |
| Unix                                           | 28.956 |
| Mac_OS_X                                       | 26.936 |
| C_(programming_language)                       | 24.242 |
| Microsoft                                      | 22.535 |
| GNU_General_Public_License                     | 22.222 |
| Mac_OS                                         | 19.529 |
| Unix-like                                      | 19.192 |
| IBM                                            | 19.048 |
| open_source                                    | 17.845 |
| FreeBSD                                        | 17.845 |
| Apple_Inc.                                     | 16.498 |
| Java_(programming_language)                    | 15.825 |
| OS_X                                           | 15.488 |
| free_software                                  | 15.488 |
| Sun_Microsystems                               | 15.152 |
| C++                                            | 15.152 |
| source_code                                    | 15.152 |
| Macintosh                                      | 14.815 |
| MS-DOS                                         | 13.468 |
| Solaris_(operating_system)                     | 13.468 |
| PowerPC                                        | 13.131 |
| DOS                                            | 13.131 |
| Android_(operating_system)                     | 13.131 |
| Windows_NT                                     | 12.795 |
| Intel                                          | 12.458 |
| programming_language                           | 12.121 |
| personal_computer                              | 12.121 |
| OpenBSD                                        | 11.785 |
| Unicode                                        | 11.111 |
| graphical_user_interface                       | 10.774 |
| video_game                                     | 10.774 |
| Cross-platform                                 | 10.774 |
| Internet                                       | 10.574 |
| OS/2                                           | 10.438 |
...

sa: T |WP: Ronald_Reagan>
+---------------------------------------------------------+--------+
| page                                                    | coeff  |
+---------------------------------------------------------+--------+
| Ronald_Reagan                                           | 100.0  |
| John_F._Kennedy                                         | 22.951 |
| Bill_Clinton                                            | 22.404 |
| Barack_Obama                                            | 22.283 |
| George_H._W._Bush                                       | 22.131 |
| Jimmy_Carter                                            | 22.131 |
| Richard_Nixon                                           | 22.131 |
| George_W._Bush                                          | 22.131 |
| Republican_Party_(United_States)                        | 21.785 |
| Democratic_Party_(United_States)                        | 20.779 |
| United_States_Senate                                    | 19.444 |
| President_of_the_United_States                          | 17.538 |
| White_House                                             | 15.574 |
| Franklin_D._Roosevelt                                   | 15.301 |
| Vietnam_War                                             | 15.242 |
| United_States_House_of_Representatives                  | 14.754 |
| United_States_Congress                                  | 14.085 |
| Supreme_Court_of_the_United_States                      | 13.388 |
| Lyndon_B._Johnson                                       | 13.388 |
| Margaret_Thatcher                                       | 13.115 |
| Cold_War                                                | 13.093 |
| Dwight_D._Eisenhower                                    | 12.568 |
| Nobel_Peace_Prize                                       | 12.368 |
| The_Washington_Post                                     | 12.295 |
| Gerald_Ford                                             | 12.022 |
...

sa: T |WP: Los_Angeles>
+--------------------------------------------------+--------+
| page                                             | coeff  |
+--------------------------------------------------+--------+
| Los_Angeles                                      | 100.0  |
| Chicago                                          | 20.852 |
| California                                       | 18.789 |
| Los_Angeles_Times                                | 17.833 |
| San_Francisco                                    | 16.704 |
| New_York_City                                    | 15.536 |
| Philadelphia                                     | 14.221 |
| NBC                                              | 12.641 |
| Washington,_D.C.                                 | 11.484 |
| Boston                                           | 11.061 |
| USA_Today                                        | 10.609 |
| Texas                                            | 10.384 |
| Academy_Award                                    | 10.158 |
| Seattle                                          | 9.932  |
| New_York                                         | 9.88   |
| Time_(magazine)                                  | 9.851  |
| Mexico_City                                      | 9.707  |
| The_New_York_Times                               | 9.685  |
| Rolling_Stone                                    | 9.481  |
| CBS                                              | 9.481  |
| Toronto                                          | 9.481  |
...
Now for a final couple of comments. First, my code is painfully slow. But to be fair I'm a terrible programmer and this is research, not production, code. The point I'm trying to make is that a real programmer could probably make the speed acceptable, and hence usable.

For the second point, let me quote from word2vec:
"It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen')"

I haven't tested it, but I'm pretty sure my inverse-links-to sparse vectors do not have this fun property. That is why I want to create my own word2sp code. Though it would take a slightly different form in BKO. Something along the lines of:
vector |some object> => exclude(vector|France>,vector|Paris>) + vector|Italy>
table[object,coeff] select[1,20] 100 self-similar[vector] |some object>
should, if it works, return |Rome> as one of the top similarity matches.
Likewise:
vector |some object> => exclude(vector|man>,vector|king>) + vector|woman>
table[object,coeff] select[1,20] 100 self-similar[vector] |some object>
should return |queen> as a top hit.

Definitely something I would like to test, but I need working code first.

Monday 23 November 2015

revisiting the letter rambler

After changing the back-end meaning of add_learn, from something that was really append-learn to a literal add_learn (I hope I didn't break anything in the process!), I decided to redo the letter rambler example. Now our ngrams have frequency information too. This has a big effect on the letter rambler, as I will shortly show.

First, a comparison of the resulting sw files:
append-learn version
add-learn (frequency) version
Noting they both use the same code

A couple of lines to visually show the difference (note the coefficients in the second case):
next-2-letters | by> => |  > + | t> + | s> + | a> + | h> + | w> + | n> + | d> + | o> + | q> + | M> + |. > + | e> + | m> + | p> + | b> + | r> + | y> + | H> + | C> + | i> + | c> + |?"> + | f> + | S> + | l> + | F> + | A> + |, > + | u> + | k> + | U>
next-2-letters | by> => |  > + 138.0| t> + 19.0| s> + 42.0| a> + 25.0| h> + 12.0| w> + 7.0| n> + 4.0| d> + 8.0| o> + 2.0| q> + 9.0| M> + |. > + 6.0| e> + 16.0| m> + 7.0| p> + 3.0| b> + 4.0| r> + 6.0| y> + | H> + | C> + 3.0| i> + 7.0| c> + |?"> + 2.0| f> + | S> + | l> + | F> + 3.0| A> + |, > + | u> + | k> + | U>
And now the new letter rambler:
sa: load ngram-letter-pairs--sherlock-holmes--add-learn.sw
sa: letter-ramble |*> #=> merge-labels(|_self> + weighted-pick-elt next-2-letters extract-3-tail-chars |_self>)
sa: letter-ramble^200 |The>
|They wered. Now did not acces of a humable, as shall beeched to belong meet as en bars in part the seriend's ask Mr. As when you. I could no had steps bell, and some forty, Mrs. Readings as so weathe that his is long colour minute mone out this, of the streat axe-Consibly after, whom I claim above use of my my doctorter inquiries. For and pisoda angry, as fulling heave dog-wheer ther obviouse when, >
And we can compare with the old version that had no frequency information by dropping back from weighted-pick-elt to just pick-elt (weighted-pick-elt takes coeffs in to account, while pick-elt does not):
sa: old-letter-ramble |*> #=> merge-labels(|_self> + pick-elt next-2-letters extract-3-tail-chars |_self>)
sa: old-letter-ramble^200 |The>
|The ont by, dea wed upset book, anxious him. Totts vanies tangibly ignotum tea cartyrdom Hosmoppine-margins," reporary, curson. Young criminas, apables Mr. Just you:  "Holbornine Aventrally, wore Sand. Royal brary. Warshy mergesty. John Ha! The dug an End on samps onlike a wound--1000 pound ord Suburly or 'G' wish us. Yes?" a "P," Holmask.  "Star' had. I owe, Winchisels whipping-schan legs. Having, I>
So similar, just a bit less English like.

That's all I wanted to show for this post.

Tuesday 10 November 2015

an interpretation of my BKO scheme

I think it is long past time to try and give an interpretation of my BKO scheme. I guess the simplest description is that it is a quite general representation for networks. Certainly static networks, but also a lot of machinery to manipulate more dynamic networks. The long term aspiration is that in its simplest form the brain is a network, albeit a very, very large and complex one, so perhaps we could use some of the BKO structures to approximate a brain? The question is though, by analogy with human flight, are we trying to build a bird with all its intricate details, or just something that flies using aerodynamic principles and simple non-flapping wings? I think with our BKO scheme we are building a plane.

Now, back to the very basics!
Recall |some text> is a ket. Sure, but what is less clear is that |some text> can be considered a node in a network. So every ket in some sw file, corresponds to a node in the network represented by that sw file. In the brain they often map to concepts. For example |hungry>, |emotion: happy>, |url: http://write-up.semantic-db.org/>, |shopping list>, |person: Mary> etc.

And it is known (from epilepsy surgery) that there are specific neurons for say "Bill Clinton" and "Halle Berry". So we can assume our brain sw file will include the kets |person: Bill Clinton> and |person: Halle Berry>.

Further, it is well known there are mirror neurons and place neurons. So for example, watching someone eat stimulates, to a lesser or greater degree, the same neuron as thinking you want to eat, or actually eating. And in rat brains there are specific neurons that correspond to places. Which implies in a brain sw file we will have corresponding kets such as |eating food>, and |cafe near my house>.

But we can also have kets that have not a lot of meaning by themselves. They only acquire meaning in a superposition with other such kets. The best example I have is for web-page superpositions. These being quite close to the SDR's of Jeff Hawkins (Sparse Distributed Representations).

But we diverge from SDR's in that superpositions are not restricted to coefficients in {0,1}. Our superpositions can have arbitrary float coefficients, though almost always positive. The interpretation being that the coefficient of a ket is the strength that ket is currently activated. I guess in the brain this probably corresponds to how rapidly a neuron is firing within some integration period. An example being: 0.8|emotion: sad> + 12|hungry> corresponds to "a little sad but very hungry".

Next, we can use this to build static networks.
For example these simple learn rules:
op1 |u> => 7.2|a> + 0.02|b> + 13|c> + |d>
op2 |v> => 5|x> + 0.9|u> + 2.712|z>
Can be interpreted to mean:
op1 links the node |u> to |a> with link strength 7.2
op1 links the node |u> to |b> with link strength 0.02
op1 links the node |u> to |c> with link strength 13
op1 links the node |u> to |d> with link strength 1

op2 links the node |v> to |x> with link strength 5
op2 links the node |v> to |u> with link strength 0.9
op2 links the node |v> to |z> with link strength 2.712
And that should be sufficient to construct arbitrary static networks. For dynamic networks we need more work. For example, currently it is impossible to associate a "propagation time" with the links. Which contrasts with the brain, where it is obvious different pathways/links take different times to travel along. Whether this feature is important in building a "plane" I don't yet know. If it is though, I can imagine a couple of ways to roughly implement it. For example, the proposed swc language, which is an extension of the current sw file format, to include more familiar programming language structures could help with this.

Now, for dynamic networks we need our suite of function operators, and stored rules, and to a lesser extent memoizing rules. "#=>" and "!=>" respectively. I suspect we can already represent a lot of dynamic networks with what we already have. And as needs arise we can add a few new function operators here and there. But for a full brain, we almost certainly need the full swc language, or the sw/BKO python back-end.

Quite bizarrely there seems to be some correspondence to our structure, intended to describe a brain, with the notation used to describe quantum mechanics. I don't know how deeply it goes, though assuming too deeply would probably be a bit crazy. There is the obvious similarity with the notation, the borrowed kets, bra's, operators and superpositions. But also wave function collapse is very close to our weighted-pick-elt operator. And making a quantum measurement is quite close to asking a question in the console.

For example:
sa: is-alive |Schrodinger's cat> !=> normalize weighted-pick-elt (0.5|yes> + 0.5|no>)

sa: is-alive |Schrodinger's cat>
|yes>
There are of course some obvious differences. The biggest being that QM is over the complex numbers, and BKO is over real valued floats. And QM has a Schrodinger equation, and plenty of other maths that is not replicated in the BKO scheme. Though of course to build a full brain we would need some kind of time evolution operator. But it is unclear what form that would take.

Finally, the idea of a path integral. That a particle in some sense travels through space-time from starting point to end point along all possible pathways is somewhat unobvious to the non-Feynman's of the world. Yet the idea that a spike train travels from starting neuron to end neuron along all possible brain pathways is obvious, and a trivial observation. This correspondence is kind of weird. And I'm not sure how superficial or deep it really is. Given this correspondence maybe it is appropriate to call a brain "brain-space"? Where concepts map to neurons, and neurons map to a 3D location in a physical brain. And operators map superpositions to superpositions, and propagate signals through brain-space.

Let's finish with one possible mapping from simple BKO learn rules to a neural structure:
where:
- the BKO is:
supported-ops |x> => |op: op1> + |op: op2> + |op: op3>
op1 |x> => |a> + |b> + |c>
op2 |x> => |d> + |e>
op3 |x> => |f> + |g> + |h> + |i>
- large circles correspond to neuron cell bodies
- small circles correspond to synapses
- labeled lines correspond to axons

But this picture was from a long time ago. I now think it is probably vastly too simple!

new function operator: inhibition[t]

The idea is, given a superposition, try to increase the difference between the top most ket and the rest. I don't yet know when it will be useful, but it seems it might be, and it was only a couple of lines of code.

How does it work? Well, you feed in a parameter that specifies how much to suppress the smaller terms.
-- define a superposition:
sa: |list> => |a> + 5.2|b> + 23|c> + 13|d> + 17|e> + 17|f> + 15|g>

-- suppress everything except the biggest term:
sa: inhibition[1] "" |list>
0|a> + 0|b> + 23|c> + 0|d> + 0|e> + 0|f> + 0|g>

-- inhibit at half strength:
sa: inhibition[0.5] "" |list>
0.5|a> + 2.6|b> + 23|c> + 6.5|d> + 8.5|e> + 8.5|f> + 7.5|g>

-- negative inhibit (ie, suppress the biggest term):
sa: inhibition[-1] "" |list>
2|a> + 10.4|b> + 23|c> + 26|d> + 34|e> + 34|f> + 30|g>

-- and again:
sa: inhibition[-1]^2 "" |list>
4|a> + 20.8|b> + 46|c> + 52|d> + 34|e> + 34|f> + 60|g>

-- and again:
sa: inhibition[-1]^3 "" |list>
8|a> + 41.6|b> + 92|c> + 104|d> + 68|e> + 68|f> + 60|g>
Hopefully that makes some sense.

Update: inhibition[-1]^k may be one way to approach the idea of creativity. When thinking of a problem, the obvious, boring, answer will have higher coefficient. But if we suppress the highest few coefficients, then maybe we have something more creative. Of course, we need some way to test the proposed solution satisfies the required properties. I'm not yet sure the way to implement this, but it seems to be something that the brain makes heavy use of. Posing questions, and then hunting for answers with the right properties.