Some Xmas Readings

A recommender system based on tag and time information for social tagging systems

Recently, social tagging has become increasingly prevalent on the Internet, which provides an effective way for users to organize, manage, share and search for various kinds of resources. These tagging systems offer lots of useful information, such as tag, an expression of user’s preference towards a certain resource; time, a denotation of user’s interests drift. As information explosion, it is necessary to recommend resources that a user might like. Since collaborative filtering (CF) is aimed to provide personalized services, how to integrate tag and time information in CF to provide better personalized recommendations for social tagging systems becomes a challenging task.

In this paper, we investigate the importance and usefulness of tag and time information when predicting users’ preference and examine how to exploit such information to build an effective resource-recommendation model. We design a recommender system to realize our computational approach. Also, we show empirically using data from a real-world dataset that tag and time information can well express users’ taste and we also show that better performances can be achieved if such information is integrated into CF.

A profile of information systems research published in expert systems with applications from 1995 to 2008

Expert systems with applications (ESWA) has been regarded as one of the highly qualified journals in the information system. This paper profiles research published in ESWA from 1995 to 2008. Based on the multidimensional analysis, we identified the most productive author and universities, research paper numbers per geographic region, and the most employed issues and methodologies used by the most highly published authors. Our results indicate that (1) ESWA is clearly an internationalized journal, (2) the most employed methodologies are fuzzy ESs and knowledge-based systems, and (3) the leading highly published authors always have diverse methodologies and applications. Furthermore, the implications for researchers, journal editors, universities, and research institution are presented.

High speed ant colony optimization CMOS chip

Ant colony optimization (ACO) is an optimization computation inspired by the study of the ant colonies’ behavior. This paper presents design and CMOS implementation of the ant colony optimization based algorithm for solving the TSP problem. In order to implement ant colony optimization algorithm in CMOS, we will present a new algorithm. This algorithm is based on the original ant colony optimization but it can be implemented in CMOS. Briefly, pheromone matrix is transformed on the chip area and ants move up-down through the pheromone matrix and they make their decisions. Finally ants select a global path. In previous researches only pheromone values is used, but select the next city in this paper is based on heuristics value and pheromone value. In definition of problem, we use heuristics value as a matrix. Previous researches could not be used for wide type of optimization problem but our chip gives heuristics value initially and we can change initial value of heuristics value according to the optimization problem so this capability increases the flexibility of ACO chip. Simple circuit is used in blocks of our chip to increase the speed of convergence of ACO chip. We use Linear Feedback Shift Register (LSFR) circuit for random number generator in ACO chip. ACO chip has capability of solving the big TSP problem. ACO chip is simulated by HSPICE software and simulation results show the good performance of final chip.

A novel prediction model based on hierarchical characteristic of web site

Internet has developed in a rapid way in the recent 10 years,and the information of web site has also been increasing fast. Predicting web user’s behavior becomes a crucial issue following the purposes like increasing the user’s browsing speed efficiently, decreasing the user’s latency as well as possible and reducing the loading of web server. In this paper, we propose an efficient prediction model, two-level prediction model (TLPM), using a novel aspect of natural hierarchical property from web log data. TLPM can decrease the size of candidate set of web pages and increase the speed of predicting with adequate accuracy. The experiment results prove that TLPM can highly enhance the performance of prediction when the number of web pages is increasing.

A Web page classification system based on a genetic algorithm using tagged-terms as features

The incredible increase in the amount of information on the World Wide Web has caused the birth of topic specific crawling of the Web. During a focused crawling process, an automatic Web page classification mechanism is needed to determine whether the page being considered is on the topic or not. In this study, a genetic algorithm (GA) based automatic Web page classification system which uses both HTML tags and terms belong to each tag as classification features and learns optimal classifier from the positive and negative Web pages in the training dataset is developed. Our system classifies Web pages by simply computing similarity between the learned classifier and the new Web pages. In the existing GA-based classifiers, only HTML tags or terms are used as features, however in this study both of them are taken together and optimal weights for the features are learned by our GA. It was found that, using both HTML tags and terms in each tag as separate features improves accuracy of classification, and the number of documents in the training dataset affects the accuracy such that if the number of negative documents is larger than the number of positive documents in the training dataset, the classification accuracy of our system increases up to 95% and becomes higher than the well known Naïve Bayes and k nearest neighbor classifiers.

Using chi-square statistics to measure similarities for text categorization

In this paper, we propose using chi-square statistics to measure similarities and chi-square tests to determine the homogeneity of two random samples of term vectors for text categorization. The properties of chi-square tests for text categorization are studied first. One of the advantages of chi-square test is that its significance level is similar to the miss rate that provides a foundation for theoretical performance (i.e. miss rate) guarantee. Generally a classifier using cosine similarities with TF * IDF performs reasonably well in text categorization. However, its performance may fluctuate even near the optimal threshold value. To improve the limitation, we propose the combined usage of chi-square statistics and cosine similarities. Extensive experiment results verify properties of chi-square tests and performance of the combined usage.

Simulated annealing with adaptive neighborhood: A case study in off-line robot path planning

Simulated annealing (SA) is an optimization technique that can process cost functions with degrees of nonlinearities, discontinuities and stochasticity. It can process arbitrary boundary conditions and constraints imposed on these cost functions. The SA technique is applied to the problem of robot path planning. Three situations are considered here: the path is represented as a polyline; as a Bézier curve; and as a spline interpolated curve. In the proposed SA algorithm, the sensitivity of each continuous parameter is evaluated at each iteration increasing the number of accepted solutions. The sensitivity of each parameter is associated to its probability distribution in the definition of the next candidate.

Identifying new business areas using patent information: A DEA and text mining approach

From a resource-based point of view, firm’s technological capabilities can be used as underlying sources for identifying new businesses. However, current methods are insufficient to systematically and clearly support firms in finding new business areas based on their technological strength. This research proposes a systematic approach to identify new business areas grounded on the relative technological strength of firms. Patent information is useful as a measure of firms’ technological resources and data envelopment analysis (DEA) is beneficial to obtain the weighted value of patents according to their quality. With this weighted quality of patents, a firm can evaluate their relative technological strength at the industry and product level according to potential business areas. To compute technological strength by products, this research applies text mining method to patent documents, a method which a researcher discovers knowledge with unstructured data with. This paper shows the usefulness of the newly proposed framework with a case study.

Communities and dynamical processes in a complex software network

Complex technological networks represent a growing challenge to support and maintain as their number of elements become higher and their interdependencies more involved. On the other hand, for networks that grow in a decentralized manner, it is possible to observe certain patterns in their overall structure that may be taken into account for a more tractable analysis. An example of such a pattern is the spontaneous formation of communities or modules. An important question regarding the detection of communities is if these are really representative of any internal network feature. In this work, we explore the community structure of a real complex software network, and correlate this modularity information with the internal dynamical processes that the network is designed to support. Our results show that the dependence between community structure and internal dynamical processes is remarkable, supporting the fact that a community division of this complex network is helpful in the assessment of the underlying dynamical structure, and thus is a useful tool to achieve a simpler representation of the complexity of the network.

Are we alone in the Universe?

Nasa prepared for latter tomorrow a conference on “Astrobiology Discovery“. Also a report on Science Journal is expected and there’s an embargo until 2.pm EST (time of the conference,19h Lisbon).

What is this all about? Extra-terrestrials? Aliens? What X-Files are these? Will we find out that we are not Earth native? Kobol? Hm… I’m mixing serious stuff with fun here, but I’m really expecting this announcement to be more than just another black hole in our neighborhood.

The Bandits’ Manual

Reasonable election systems cannot make manipulation impossible. However, they can make manipulation computationally infeasible.

It’s time for elections here in Portugal and interestingly the ACM Communications has an article on how to protect elections by the use of complexity (computational complexity). The article is fun to read going through several systems of voting and several approaches to manipulate and protect elections.

It is interesting to see in this clear way the ways elections systems are flawed and where the vector of attack is for someone that thinks about controlling some election process. When a voter thinks he is being part of a democratic process, he should think again after reading this article. Elections systems are not democratic, they are a construction made by men, in a way that best served some interests at some particular point in time. As so they are exposed to terrific manipulation schemes and our only hope is to make the system so difficult to manipulate that it deters those forces from trying.

Boilerplate: Article extraction from webpages

The amount of clutter text present at different webpages makes the task of discovering what is important a pain. At the observatorium I’ve been using a simple Tag to Text ratio to try to extract the important sections of text from webpages. The results are good, but not great, the method is fast and it works if one has in consideration that noise exists and can’t be totally eliminated.

The other day I found another technique that I think might become my de facto standard technique for text extraction from webpages as its first results are better than what I expected. The algorithm is able to detect the meaningful sections of pages with high accuracy and also has the benefit of being truly fast.

This is derived from the paper “Boilerplate Detection using Shallow Text Features” by Christian Kohlschüster et al. that was presented at WSDM 2010. and there’s a google code repository available with the Java source and binaries to download.

On Google self-driving cars

I couldn’t agree more with what Alan Winfield says about this problem: The challenge is not to have an autonomous car, the challenge is to prove the correctness of the system, and of a system that learns during the process. This type of adaptive systems pose several challenges and probably the most challenging one is the public perception of their quality and robustness. We’ve seen some problems recently: The cases of the electronic accelerator of the Prius or the emergency-breaking system of the Volvo C60, both show that these systems are  in their early deployment days. To go from this stage to a future driving experience where our cars are our personal chauffeurs there’s still a long road ahead (pun intended), and probably we wont see these kinds of cars in the next 10 years or so.

Links:

Google Announcement Alan Winfield blog entry

Filosofia e Neurobiologia desafiam a IA

António Damásio - O Livro da Consciência

António Damásio - O Livro da Consciência

Depois da apresentação que assisti hoje na FC do professor Hélder Coelho sobre o último trabalho do António Damásio, o Livro da Consciência entrou agorinha mesmo para a minha lista de compras (a fazer no fim de semana).

E agora um link sobre parte do trabalho que está por detrás deste livro que foi publicado em 2009 na NPR (rádio pública americana).

Quanto à palestra foi pena que não se tivesse tempo para a segunda parte onde efectivamente a ligação às questões da inteligência artificial seriam mais aprofundadas, mas certamente haverá outra oportunidades.