Analysis software demo
The software evaluates the textual quality of a Web page: A crawler knows whether or not to index a document, a search engine can sort out answers in descending order of quality.
Main language of the document :
Density : 0 % Variability : 0 % Quality : 0 %
time analysis : 0 s
Gelatino v1.2.2
Density : 0 % Variability : 0 % Quality : 0 %
time analysis : 0 s
Gelatino v1.2.2
Density : The proportion of information in a text
Variability : All forms that information comes in the text
Quality : The probability of being informative
Quality = ƒ (Density, Variability)
The quality of a text is its ability to inform, to answer a query from a search engine, or the ability to be indexed.
Troubles
Some Web pages may be badly analyzed. Here main problems encountered by the analyzer.
Html frames
There are pages layed out with html frames. The soft is unable to follow links over the page in order to catch their content. The text into a framed page is not analysed.Charset
The soft is expecting "iso-8859-1" or "utf-8" charset. Currently, conversion from other charset is not done.Wrong display
Sometimes, html code's page doesn't correspond with what it is exactly displayed. The soft looks for the text as it appears into the html code, not the text which appears on the screen. A part of it could be voluntarily hidden.Unrecognized language
French and English are currently the two languages which have their own analysis plug-in. With very small text, a language may be not well recognized.No text from URL
Some web pages contain text made with javascript language (ads as Google Adsense) or other processes with no html tags. In this case, texts may be not catched.Overview
Gelatino is a software wich allows to measure the quality of a text. High density and high variability make a high quality text.
Working
To evaluate density and variability of a text, Gelatino goes closely into linguistic criteria, as morphology and syntax. Indeed, Words change with speciality, but syntactic structures stay stable under any corpus. The more informative the text is, the more numerous and varied syntactic structures are.The advantage of Gelatino's syntactic model regarding a statistical model, is that Gelatino can't be mistaken by repetition of same words (in order to force the indexation of the page for these words). The other advantage is the independance between the text and its length. Small dense texts are evaluated as more important than long hollow texts.
Uses
It's possible to use this indicator of textual quality when text is crawled par robot crawler, in order to only indexe informative pages, then dramatically reduce noise into database.After indexation, search engine can sort results with different criteria as popularity (but we know limits of this model), and with quality criteria. The text with both best popularity and quality could be displayed in priority.
Demo
The demo version of this software is written in non optimized php language, and currently runs under common web hosting. Requied computing power to analyze texts is very small. The great speed of analysis authorizes a use online during crawling. Too long texts are truncated in order to avoid abuse.For web pages, all html tags are cleaned. Just the text between body tags is analysed.
Next step
Other languages will be added, including: Spanish, Portuguese, German, Danish and Italian, ...Other indicacators will be developped.
Author
Experience
I have been working on textual analysis since 1990. I obtained a post-graduate diploma of Claude Bernard University of Lyon (France), in Natural Language Processing. Then, in 1993, I obtained an Apple trophy for a textual summary software. I've also worked for 4 years as research worker for a search engine, start up of the Internet. My work on textual indexation was patented.At present
I'm a webmaster for my own business. I keep my passion for Natural Language Processing. I put develop indicators of textual content to help companies in this sector to expand their products of natural language processing.As my spoken English is not very fluent, you should rather contact me via email. If you phone to me, please, speak slow.
ng@ngweb.fr