March 02, 2004
machine learning to Auto-category::[View]

Back from Austin(Fifth Annual ASIS&T Information Architecture Summit 2004), very impressive meeting.
There is one talk [Using Machine Learning Techniques to Populate Dynamic Interfaces] which talking about the information clustering.
This is exactly what I am doing now. But I am not using clustering method, I am using NN(neural network) to autodiscover the information of the documents.
After meeting, I talked with the speaker Miles Efron a PostDoc in University of North Carolina, he was impressive with what I was doing. But what I am concern is the systemic error of the NN is too huge.
Here is what I have done in last few month:

1] What is the information of one document?
Category, key words, Discription, Metadata.
Category will be very difficult to allocate if the document it's self didn't assign to a category when it was published. So, my question is :To help to archive a accuracy search result, I need assign all documents to certain category.
2] How to using Neural network to auto-ctegory?
(1) I have around 5000 documents which already known the categories, which is devided in to 15 categories:
Novel(ID=1), money(ID=2), study(ID=3), social(ID=4).....
Put all {keywords, discriptions, metadatas, contents} in to NN and tell NN that they are belong to 15 categories.
My nn is MLP(Multi-Layered-Perceptron ) with 2 layes and 72 neurons.
Trainning for 600 times with optimal steps, my NN is pretty stable with the output from 1(Novel); 2(money),..... Which mean my NN can recongaize these documents and already know that which categories they should be.
3] Verification.
It maybe over tranning, which can cause very large systemic error.
testing it with the new 4500 documents (Already known the categories)which my NN never seen. The result is consistent with the [2](Tranning result).
4] Testing
Now, using google download 3000 documents, let them pass my NN, ok, my NN will give a probality of this document be 15 categories:
Example: for this document, the result is:
Technology: 79% Error:10%
Biology:15% Error 4%
.....
.....
Social:0% Error:0%
Sports:0% Error: 0%
The overall systemic error is 14%
So, this post will automatic to be assigned to "Technology Category" by my NN machine.
BTW, using NN can also generate the Key words and discription, which is something real IA(Information Architecture ) I will focus on when I launch my search engine.
Posted at March 2, 2004 11:33 AM by Liang at 11:33 AM | Comments (0) | TrackBack(0) | Booso!| Niu.la收藏!Trackback
You can ping this entry by using http://www.wespoke.com/cgi-bin/mt/mt-tb.cgi/352
