Research & Conferences
 
 
Home Research Projects Conferences Register Form Contact

Abstract Acknowledgement Introduction Common Neurolinguistic Map Best Reflection Extract the ContentDNA Latent Semantic Optimization Commercial Applications Conclusion terminology References Table of Contents

 

The use of Latent Semantic Optimization (LSO) to
optimize Search Engine Ranking Positions.

Ir. David De Bock

October 2007

This paper is written as basis for the thesis “Ranking Optimization of a Web Page by Means of Latent Models

Master’s degree in Artificial Intelligence, Department of Computer Science.

Prof. Marie-Francine Moens

University of Leuven
KUL, 3000 Leuven, Belgium

Permission is granted to quote short excerpts and to reproduce figures and tables from this paper, provided that the source of such material is fully acknowledged.

    

1. Abstract

 
The scope of this paper is to introduce the theory on Latent Semantic Optimization(LSO) as a tool for Search Engine Optimization or SEO.
 
Current SEO tools can be summarized into three major groups of commonly agreed upon guidelines and recommendations:
 
     - First group: Guidelines related to the code (HTML, XML, PHP, etc.)
     - Second group:Guidelines related to linking: both to and from the website
     - Third group:Guidelines regarding the content of the website.
 

There is widespread agreement amongst today’s SEO experts over the do’s and don’ts regarding the first two groups. These are generally based on a set of practical rules and guidelines which tell you more or less how to build your website in a way that is attractive for the search engines. This is nevertheless far from an exact science, still largely based on trial and error, so there are naturally endless contradictions to be found regarding SEO regulations, too. This paper will not tell you who is right or wrong, nor will we go deeper into any of the guidelines from the first two groups, these being beyond the scope of this paper.

The third group, “Guidelines regarding the content of the website”, is far less commonly described in SEO guidelines and recommendations. The only generally agreed guideline is “write good content”. However you will not find a satisfactory definition within the SEO world as to what exactly constitutes “good content”. In short, current SEO knowledge regarding what major search engines consider “good content” and “less good content” is far less clear – rather blurred even.


However,determining which website is “on topic” and which is “a little bit less on topic” is of course of critical importance for a search engine – and will largely determine its success on the internet. This explains the success of Google, for example. Google will rank the websites whose content is most closely on topic for the search term you are looking for. Such website ranking is not entirely based on the content being on topic – other factors such as back-linking will also determine the ranking – but the content makes up a large part of the ranking algorithm. We, and many other SEO experts, estimate that content is 50% responsible for the final ranking.


In order to determine whether the content of a webpage is on topic for a certain search term, the major search engines (Google, Yahoo, etc.) use modifications of the Latent Semantic Analysis and Latent Semantic Indexing (LSA and LSI) models. These define how the ideal content for a certain search term should look, and rank the webpage according to how closely it resembles this ideal content or ContentDNA.


This paper will prove that by using a model called the Latent Semantic Indexing Model of the Second Generation it is theoretically possible to approximate the ideal content or ContentDNA for a certain search term very closely. This technique is called Latent Semantic Optimization (LSO).


Latent Semantic Optimization is a technique to optimize Search Engine Ranking by using Latent Semantic Indexing and Latent Semantic Analysis to approximate the ideal content or ContentDNA. This paper will give a summary of how LSA and LSI work and how they can be implemented in a theoretical tool to optimize your website ranking.


Keywords: Latent Semantic Optimization, LSO, ContentDNA, Latent Semantic Analysis, LSA, Latent Semantic Indexing, LSI, Search Engine Optimization, SEO.
 
    

2. Acknowledgement

 
I would like to thank my wife and children for their continuous support during the long evenings when I was occupied writing this paper. I would also like to particularly recognize and thank my parents for their constant support of my endeavors, whatever they have been.
 
    

3. Introduction to Latent Semantic Optimization

 

This paper starts out by explaining how a group of people speaking the same language and living in the same geographical area develop over time a Common Neurolinguistic Map for every topic they talk and write about.

We will transfer this concept to the internet, arguing that ContentDNA is the closest match on the internet for the Common Neurolinguistic Map . Furthermore, search engines rank webpages based largely on the extent to which they correspond to ideal content or ContentDNA.

We will then outline in broad strokes how today's search engines are using Latent Semantic Analysis and Latent Semantic Indexing to determine the exact content or ContentDNA.

Finally we will show that by using the same techniques – LSA and LSI – it is theoretically possible to approach very closely, for any search term, the ideal content or ContentDNA.

When you know what the ContentDNA for a certain search term is, you can use it within the content of your webpage to optimize your search engine ranking – in other words, as a SEO tool.

    

4. The Common Neurolinguistic Map

    


4.1. The human brain creates a Neurolinguistic Map for every topic.

 

From the time that we utter our first words we start to create mental connections between words and word combinations. Our brain connects words because we use them ourselves in a certain combination, or because we read or hear them in a context that reflects our environment, our education, our moral values, etc.

At a young age our brain is at its most active in creating those connections between words; the older we become the more difficult it is to make new ones, and the more we rely on the old ones we have created at a younger age.

When we have a conversation, or read or write about a certain topic , our brain is continuously building relational bridges between words and word combinations around the topic concerned.

Example: when we have a discussion on the topic “Bill Gates”, we will automatically start speaking about “Microsoft”, “ Vista ” and “Windows”. Because those words are so often used together in so many media (internet, newspapers, etc.) our brain has built strong neural relationships between them.

    


4.2. Cultural and geographical differences reflect in a different Neurolinguistic Map.

 

The Neurolinguistic Map varies from person to person, corresponding to cultural, geographical, linguistic, religious and other differences.

Even if people speak the same language – North Americans and Australians, for example – or live in the same country – like Texans and New Yorkers – they will have a different Neurolinguistic Map for most topics. .

Example:
Topic : “Football”
When people in the UK think of the topic “Football” they will associate this with “Manchester United”, “World Cup”, “Premier League”, etc.

When people in North America think of the topic “Football” they will associate this with “New York Giants”, “National Football League”, etc.

 

    

4.3. Although we all have a different Neurolinguistic Map as individuals we can share a common part as a group: The Common Neurolinguistic Map.

 

When a group of people who speak the same language have association – for example by living in the same country or state, by having the same religion or by sharing the same historical background – they tend to build similar neural connections between words and word combinations on given topics.

They are all still individuals, each with their own individual Neurolinguistic Map. But the individual Neurolinguistic Maps overlap for certain topics.

We call this overlap the Common Neurolinguistic Map

    Let's explain this with an example:

     Topic: “Red wine”

Two residents of Paris , both born and raised there, share in addition to the French nationality an adoration for French red wine! Gaston knows a lot about the Bordeaux region and could tell you endless stories about its wines and cellars. The second one René knows more about the red wines of Bourgogne and could give you interesting resumes about the numerous bottles of Bourgogne he has laying in his own wine cellar.

We asked each of them to write down the first 10 words that came into their minds when we said “Red wine”:

This is the list of Gaston

1    Bordeaux
2    Pomerol
3    Saint-Emillion
4    Fronsac
5    Pinot noir
6    Grand crus de Bordeaux
7    Chardonnay
8    Good wine
9    France
10  Red meat

 

This is the list of René

1    Bourgogne
2    Beaujolais
3    Dijon
4    Grand crus de Bourgogne
5    Côtes du Rhône
6    Cheese
7    France
8    Meat
9    Pinot noir
10  Chardonnay

 

Both men begin their lists referring to the wines of the area they each know the best. But thereafter we see some commonalities:

   •  Both of them name the same grapes, since both wine areas use these particular grapes:

 

  Gaston

René

   5 Pinot noir
  7 Chardonnay

 

9 Pinot noir
10 Chardonnay

 

   •  Both of them have more or less the same way to point out their best wines: “Grand crus de Bordeaux ” and “Grand crus de Bourgogne ”. Although both terms are different there is a common part, “Grand crus de”

  6 Grand Crus de

 

4 Grand Crus de

 

   •  Gaston is convinced that Bordeaux goes best with “red meat” while René considers “meat” in general an appropriate accompaniment for Bourgogne . We can conclude that they would both agree that “Red wine” goes with “meat”. So the common part is “meat”:


  10 meat

 

8 meat

 

   •  Finally, both of them are chauvinistic to the bone, and consider France to be the best red wine producing country in the world…

   9 France

 

7 France

 

 

To reveal the Common Neurolinguistic Map of the word combination “Red Wine” we use the following two steps:

Step 1: We find out what the common words and word combinations are and calculate their average score.

Average score    Common part

(5+9)/2= 7           Pinot noir
(7+10)/2=8.5       Chardonnay
(6+4)/2=5            Grand Crus de
(10+8)/2=9          meat
(9+7)/2=8            France

Step 2: We rank the words and word combinations according to the average score, resulting in the Common Neurolinguistic Map .

1 Grand Crus de
2 Pinot noir
3 France
4 Chardo nnay
5 meat

Conclusion

Although both Gaston and René have a different Neurolinguistic Map build around the topic “Red wine”, there is a common part that binds them. The common part is the knowledge both of them have based on their mutual association with the topic “Red wine”.

This common knowledge is determined by

Geography: They both live in France
Language: They are both native Frenchman
Culture: Born and raised in Paris with a strong chauvinism.

This common content around the topic “Red wine” – that is, the Common Neurolinguistic Map for the topic “Red wine” – is created and formed through various media and experiences: education, social events, magazines, newsletters, advertisement, books, verbal communication, etc.

This common knowledge around the topic “Red wine” has historical roots not only in the minds of Gaston and René, of course, but in the minds of all Frenchmen living in France .

Were we to ask the same question to all Frenchman living in France and make an average of the common part, we would get a close reflection of the actual common Neurolinguistic Map of Frenchmen living in France for the topic “Red wine” .

 

    4.4. The growth of a Human Semantic Environment
 

A Human Semantic Environment defines an environment of a group of people who share a large Common Neurolinguistic Map for many different topics.

The growth of a Human Semantic Environment is mostly generated by:

    •  The country or state that people live in:

People living in the same geographical area naturally form a Common Neurolinguistic Map on many topics.

    •  The language people speak :

Speaking the same language is a condicio sine qua non for growing a Common Neurolinguistic Map.

    •  The moment in time that people live in :

Although the first pioneers living in North America spoke English, their Neurolinguistic Maps of certain topics have evolved dramatically over the ensuing years.

Therefore today's North Americans do NOT have the same Human Semantic Environment as their forbears , although they live in the same country .

On the other hand our Common Neurolinguistic Map still has a lot of similarities with the Common Neurolinguistic Map of our parents and grandparents. This indicates that widespread differences occur only over many generations.

Conclusion:

A Human Semantic Environment defines an environment of a group of people who share a large Common Neurolinguistic Map for many different topics

This is largely generated by people who live in the same country or state, share a common language and who are living (or have lived) in a defined period of time.

Four examples of a Human Semantic Environment: generations before the collapse of the wall.


    •  English speaking Americans living in California for the past three generations.
    •  French speaking Canadians living in Canada for the past five generations.
    •  German speaking East Germans living in East Germany for the two
    •  English speaking Americans living in New York since 9/11.
 
 

    4.5. For a defined Human Semantic Environment there will be, over time, a slow evolution of the Common Neurolinguistic Map.
 

Because of the rapid evolution in science, industrial development, IT, medical research, etc., mankind is continuously developing new Human Cognitive Relations around new and – more important – existing topics

Moreover, the evolution of our living environment – due to global warming, for example – results in new Human Verbal Concepts around existing topics.

As a result the Common Neurolinguistic Map for a given topic in a given Human Semantic Environment will change over time.

Example:

Human Semantic Environment 1 :

English speaking Englishman living in the UK in the '80s

Human Semantic Environment 2 :

Topic:

“computer”

Should you have asked 100.000 people from Human Semantic Environment 1 to write down their 100 first thoughts on the topic “computer”, then words like “floppy disk”, “Commodore” and “IBM” would have been high in the list.

Were you to ask 100.000 people from Human Semantic Environment 2 to write down their first 100 thoughts on the topic ”computer”, then words like “CD-Rom”, “Apple” and “Microsoft” would be very high in the list, “IBM” would be much lower down, and only at the very bottom might you still find “Commodore”.

Conclusion:

For a defined Human Semantic Environment there will be, over time, a slow evolution of the Common Neurolinguistic Map .

For a given topic: , words and word combinations that are becoming less popular or used less frequently in our daily conversations will drop down the list. And words that are getting more popular because they are used more often in different media will rise up the list.

So the Common Neurolinguistic Map for a given topic evolves naturally over time.

 

    4.6. Overnight changes of the Common Neurolinguistic Map for a certain topic.
 

The Common Neurolinguistic Map might also change abruptly at a certain point in time. An extraordinary event affecting a total population, or a certain group within the population, will change the Common Neurolinguistic Map of certain related topics.

This means that overnight the Human Semantic Memory around a certain topic can change dramatically as new neural bridges are created. There are two distinctive trends we can outline from this perspective:

First trend:

Events with a negative effect for a Human Semantic Environment have a stronger impact on the Common Neurolinguistic Map than positive events. The Human Cognitive Processes more easily memorizes bad events than good events.

Second trend:

The Human Semantic Memory can be created overnight, but cannot be destroyed overnight. In other words it's relatively easy to create new Neural Bridges around a certain topic , but relatively difficult to destroy existing ones.

As a group we can hold certain phenomena for a given topic in our memory for a very long time; on the other hand, we can change our opinion regarding the same topic overnight, remembering how it was in the past compared to how it is now.

Example
Human Semantic Environment 1:

English speaking Americans living in North America before 9/11

Human Semantic Environment 2:

English speaking Americans living in North America after 9/11

Topic:

“ World Trade Center ”

Common Neurolinguistic Map based on Environment 1

Before 9/11, North Americans, like most people, related the World Trade Center with the core business centre of New York , or even the core business centre of the world. The WTC was associated with prosperity, capitalism, the mighty dollar, New York , North America , etc.

After 9/11, the Common Neurolinguistic Map of North Americans changed overnight. People now thought of the World Trade Center in connection with terrorism, Bin Laden, insecurity, attack, war, etc. This eventually reflected in all media, from newspapers and magazines to the internet.

    

5. The best reflection of the Common Neurolinguistic Map can be found on the internet.

 

    5.1. The internet is at present the most accessible and the fastest medium for providing a reflection of the Common Neurolinguistic Map.
 

The internet is now – by far – the fastest medium to follow evolutions of mankind. It is also currently the most accessible medium for retrieving instant knowledge. Speed, however, has its disadvantage in that not all content can be instantly verified to be 100% correct. So although the internet is the fastest and most accessible medium , it is certainly not the most accurate one.

Printed media, certainly on scientific topics, retains the advantage of a higher degree of certainty regarding accuracy . However compared to the internet it is slow and, lacking a central database, far less accessible. Therefore it is more difficult (although not impossible) to retrieve a reflection of the Common Neurolinguistic Map from printed media.

Printed media is most useful as a reflection of the Common Neurolinguistic Map for topics that are historically relatively stable.

Example

Human Semantic Environment:

English speaking Americans living in North America today

Topic:

“American civil war”

Over the decades most topics regarding the American Civil War (1861–1865) have been discussed, written and rewritten numerous times in a multitude of books, newspapers, magazines and other printed media. Because there have not been dramatic new events regarding this topic , the content that can be found in this written media is a good reflection of the Common Neurolinguistic Map.

On the other hand, the internet is a far more useful medium for garnering a reflection of the Common Neurolinguistic Map for any topic that continues to evolve over time and where an instant up-to-date insight is required.

Thanks to its endless amount of data, in combination with near instant updating, the internet harbours – somewhere in its unbounded content – a copy of the Common Neurolinguistic Map for every topic.

 

    5.2. Intelligent search engines such as Google have realised the importance of finding a copy of the Common Neurolinguistic Map for every topic on the internet.
 

Intelligent search engines such as Google have realized the importance of knowing exactly what the Common Neurolinguistic Map is for a given topic. When a search engine knows this exactly, they can compare the content of every website with the content of the Common Neurolinguistic Map.

Search Engines that manage the technology to rank websites according to the closeness of their content to the Common Neurolinguistic Map are – and will be – highly successful. Google is a prime example.

These search engines are considered to be good simply because they provide a search result which most closely reflects the human Common Neurolinguistic Map for a certain topic.

Example

Human Semantic Environment :

Englishman living in the UK today

Topic:

“ Manchester United”

In order to rank all the websites on the internet that mention the term “Manchester United” in their content you would have to do the following:

STEP 1

 Find out what the Common Neurolinguistic Map is for the topic “Manchester United” for all the people living in the UK

STEP 2

 Compare the content of all the websites that mention the term ”Manchester united” in their content with the content of the Common Neurolinguistic Map of step 1 and rank them accordingly.

These seem like two easy steps but, as explained below, are far from easy.

EXPLANATION OF STEP 1:

The theoretical approach:

Theoretically, you would have to interview everybody in the UK, ask them to speak freely for about 10 minutes on the topic “Manchester United”, summarize all the individual interviews and make a final summary of the, say, top 200 words and word combinations they have used.

This would result in a good ‘boil down' of the Common Neurolinguistic Map for the topic “Manchester United” of the Englishman living in the UK today.

To follow such a theoretical procedure for all the topics on which you want to rank websites would be next to impossible, of course, not to say time-consuming and expensive.

The practical approach of the intelligent search engines: the ContentDNA

Statement:

“There is a reflection of the Common Neurolinguistic Map hidden within the content of all the websites writing about the topic “Manchester United” ”

The difficulty is finding a system to extract this content matching the Common Neurolinguistic Map – which we will call the ideal content or ContentDNA – from the thousands and thousands of websites writing about the topic “Manchester United”.

Definition:

The ContentDNA of a certain topic is the content extracted from all websites writing about this topic that most closely matches the content of the Common Neurolinguistic Map .

EXPLANATION OF STEP 2:

Once the search engines have succeeded in determining what the ideal content or ContentDNA for a search term or topic should look like, they can compare the content of every website on the internet with the ContentDNA and rank them accordingly.

You will realize that the real difficulty here lies not in the comparing and ranking of the websites – there are numerous techniques and scripts available to do this accurately – but in finding a smart way of determining the ideal content or ContentDNA .

At present, Google is by far the most intelligent of the search engines. Based on the accuracy of their search results, we can be sure that Google has developed a technology based on Latent Semantic Analysis in order to find out what the ContentDNA is for every topic.

 

     5.3. The Search Engine Rank Position (SERP) is based on the match with the ContentDNA.
 

How the total algorithm of the major search engines (like Google) works, is not public knowledge and probably never will be.

We can, however, state the following:

Statement:

“All intelligent search engines base their Search Engine Rank Position or SERP for a given topic largely on the degree to which the content of the ranked websites matches the ContentDNA

Explanation:

According to the statement, content is “largely” responsible for the SERP ranking. There are of course many other factors that also influence ranking, for example:

•  the Page Rank of the websites

•  the Click Through Rate of the websites

•  whether the websites comply with W3C

•  …

These factors belong more to the realm of general SEO internet discussion and are beyond the scope of this paper.

Whether content is responsible for 50 % or 40 % or 60% of the final ranking of a website is the discussion point of many SEO experts and also beyond the scope of this paper. In any case, the proposition – that the match with the ideal content or ContentDNA makes up a large part of the ranking process – holds.

    6. Search engines use models based on Latent Semantic Analysis to extract ContentDNA
 

    

6.1. Basics of Latent Semantic Analysis

 

Latent Semantic Analysis, or LSA, is about putting every word into a space model, where the vectors represent the relevancy between the ‘search term' and the surrounding content.

We will illustrate the basic operation of Latent Semantic Analysis with an example:

Suppose we want to rank all the websites on the internet for the search term “wine”. Therefore we need to find out what the ContentDNA for the topic ‘Wine' is.

Topic:

‘Wine'

Limitation of the example:

Let's pretend, for simplicity, that the internet contains only three websites that mention the topic “wine”, that there is only one search engine environment (e.g. www.Google.com), and that everybody speaks English.

W1: Wine is made of grapes!

W2: “ Bordeaux is a good wine”

W3: - Bordeaux wine- is made of grapes

Step 1: Parsing the content of the websites.

The process of parsing is a science in itself and a full explanation is beyond the scope of this paper. We can, however, summarize it as the following basic steps:

•  Filter all the HTML and other code to retrieve the pure content
•  Remove all non value symbols, such as: (“{% etc.
•  Filter out all the ‘stop words', such as: “the, I, you, we, etc.
•  Remove all the layout elements, borders, lines, etc.

The result of the parsing process should be a clean text without stop words. It is of course important to realize that parsing is language-related when you remove the stop words.

Note: For more information on parsing we refer to:

The content after parsing:

W1: Wine grapes

W2: Bordeaux good wine

W3: Bordeaux wine grapes

Step 2: Index and inverted index in the lexicon

Indexing and inverted indexing is a complex process requiring specific knowledge of how to operate large content databases. The basics, however, can be simply explained: Once the content is parsed, it is ready to be indexed on the lexicon. The lexicon is like a large dictionary , containing all the words that have ever been used on the internet in different languages.

Indexing involves indexing all the words of the different websites according to their position in the lexicon:

Let's assume that the lexicon for the English language only contains the following words:


1 grapes
2 Bordeaux
3 good
4 wine

Index of the websites in our example:

W1:     4 _ 1

W2:     2 _ 3 _ 4

W3:     2 – 4 _1

In order to be able to index the websites according to a certain search term or topic you need to invert the index. The inverted index starts from the lexicon and marks which words occur in which website:

Inverted index on the lexicon for the example:

W1        W2        W3

1 grapes          1                     1

2 Bordeaux                  1         1

3 good                         1         

4 wine             1           1        1

 

A detailed explanation of how indexing and inverted indexing works is beyond the scope of this paper.

Step 3: Term Space model of the lexicon

The Search term is at 0 and the axes represent the surrounding content.

0,0,0 point:     represents     ‘wine'
X axes:            represents     ‘grapes'
Y axes:            represents     ‘ Bordeaux '
Z axes:            represents     ‘good'

 

Because we are looking for the correlation between the topic and the complete lexicon, we put the topic on the 0 point and all the words of the lexicon on the different axes.

Term Space model (this is the same as the inverted index)

The Term Space model is a visualisation of the inverted index based on Latent Semantic Analysis

The vector representing the relevancy between ‘wine' and the content of website Nr. 1 is (1,0,0)

The vector representing the relevancy between ‘wine' and the content of website Nr. 2 is (0,1,1)

The vector representing the relevancy between ‘wine' and the content of website Nr. 3 is (1,1,0)

The total vector of the Term Space model is:

(1,0,0)
(0,1,1)
(1,1,0)
(2,2,1)

This tells us that the most relevant content based on our small database is ‘ Bordeaux ' and ‘grapes'. ‘Good' has also a relation to wine, but to a lesser extent.

The ContentDNA for the topic “wine” for the limited environment and limited lexicon for this example looks like this:

This representation of the ideal content or ContentDNA is limited because we are abstracting from a limited model.

The ContentDNA from the Latent Semantic Analysis model is based on the Term Space model. This model is limited but can serve as a basis for the further models based on Latent Semantic Analysis and Latent Semantic Optimization

 

    6.2. Limitations of using Latent Semantic Analysisto extract the ContentDNA
 

The following limitations show that the use of Latent Semantic Analysis is limited as a means for approaching the ideal content or ContentDNA.

•  Term weighting : So far we have assumed all words in the content of the website to be equally important. This means that we have not, for example, taken into consideration the following term-weighting criteria:

•  The position of the words in the website.

•  The syntax of the words: example, bold, italic, underlined, etc.

•  The absolute count of how many times the words are used in the website.

•  …

•  Website weighting : We have assumed that these three websites can be considered as being equally well-informed on the topic “wine”. In reality the relative importance of the websites will be evaluated. This relative importance will be influenced by other websites and surfer activity. The relative importance of the websites or the “website weighting” can be measured by, for example:

•  The Page Rank, or PR – a tool developed by Google to measure the number and the weight of the backlinks pointing to the website.

•  The “Click Through Rate”, or CTR – a measurement used by all major search engines to indicate how attractive a specific webpage is to the internet surfer. When you have a high CTR this means that surfers do not find the content interesting; a low CTR means they spend relatively more time on the page, and hence found it relatively interesting.

•  The topic concentration – another measurement used by all major search engines which shows the relatively importance of the topic compared to the total content of the website: when you have a website of four pages of which three pages are on the topic “wine”, this website will have a higher “topic concentration” then a 150-page website with three of the pages on the topic “wine”.

•  …

•  Term-Term relations: The theoretical model of Latent Semantic Analysis breaks the content down to single words. This means all word-to-word relations are ignored. Only the topic-word relations are analyzed.

ContentDNA which does take into consideration term-term relations will give a closer approximation of the Common Neurolinguistic Map than ContentDNA which does not.

•  The syntax of words: Because LSA ignores term-term relations, the syntax of the word combinations is also not considered. Again, ContentDNA that does take into consideration the syntax of words and word combinations will result in a closer approximation of the Common Neurolinguistic Map.

•  Complexity of the model: In our example we used a Term Space consisting of only three axes. Imagine the complication involved in using one with 65 Million axes (the number of words in the Google lexicon).

 

This list of limitations illustrates how, although the theoretical approach could be considered to be the ‘right’ way to approach the ideal content or ContentDNA, the reality is far more complicated – and requires an adjustment of the theoretical LSA model.

 

    6.3. Latent Semantic Indexing or LSI as an evolution on LSA with the incorporation of term-term relations
 

Were Google or any other smart search engine to use LSA according to the strict theoretical model as described above, they would only be able to measure relations between a search term and single words.

The human brain, or more accurately our Semantic Neural Network, is not at all based on a one-to-one matrix relation of single words. The Semantic Neural Network can be visualized as a complex spider web, with words at the points where the threads join.

To explain the relative importance of term-term relations we will use an example:

 6.3.1. Example to explain the limitation of LSA


Topic:

  “sports cars”

Human Semantic Environment:

 English speaking Americans living in North America today.

                                                                                                                        The result of the LSA indexing:

Term

LSA index

 

Ferrari 0.89
Porsche 0.82
speed 0.67
tuning 0.14
radio 0.12
exhaust 0.11
CD's 0.02
red 0.01
mufflers 0.01


The ContentDNA based on LSA for the term “sports cars” looks like this:

Latent Semantic Analysis can be used to generate the ContentDNA for a certain search term. However this model is limited as the human Semantic Neural Network does not use a matrix of one-to-one relations between single words but rather one comprising many-to-many relations. The LSI model of the second generation is a better model for generating ContentDNA. It uses a process called Latent Semantic Optimization, as will be explained further on.


Should we decide to use the top six terms from the ContentDNA, they would be:


1 Ferrari
2 Porsche
3 speed
4 tuning
5 radio
6 exhaust
 

CDs, red, and mufflers have a relatively low LSA score.



 6.3.2. Visual example of Human Conceptual Knowledge showing the limitation of LSA.

To illustrate the limitation of the LSA model, we will approach this example from another angle:

In order to understand how Human Conceptual Knowledge is formed around the topic “sports cars”, imagine you are sitting in a car inside your brain, driving along the neural highways of your own Semantic Neural Network.

You have arrived in a big city called “Sports Cars”. Cruising the ring road of this city you notice different ways leading out of town:

There are three highways – one going to another city called “Ferrari”, one going to the city of “Porsche” and one going to a major city called “Speed”. Signs also indicate that some of these highways continue on to other major cities: the highway going to “Ferrari” continuous to the city called “red” and the highway to “Porsche” seems to go all the way to “Germany”. There are no highways continuing from the city “Speed” however. There are certainly normal roads going out of “Speed” but that is not indicated on the highway signs.

Staying on the ring road you notice there are also some normal roads going to a few smaller nearby cities: “Tuning”, “Radio” and “Exhaust”. It is also indicated that from these small cities you can drive on to reach “Parts”, “Amplifier”, “CD player” and other towns.

Finally there are some small dirt roads going to the cities “CDs”, “Red” and “Mufflers”.

Thus far the roadmap of our neural brain looks like this:


This drawing shows that a tool using Latent Semantic Optimization or LSO could give us a visual overview of the Human Conceptual Knowledge for a given topic. Any application based on Latent Semantic Optimization would generate a ContentDNA which can be used to reconstruct the human Semantic Neural Network.


Explanation of this visual example:

Our neural brain associates “sports cars” with the single word “Porsche” as we see from the LSA model above. The LSA score for “Germany” is so low that this score does not appear in the ContentDNA based on LSA.

However our Human Conceptual Knowledge automatically associates “Porsche” with “Germany”, simply because Porches are made in Germany.

For this reason our neural brain also associates “sports car” with the combination “Porsche Germany” (because there is a neural highway continuing form Porsche to Germany).

Let us consider the term “red”: our neural brain has no significant association between “sports car” and “red”, simply because there are so many colours you could associate with a sports car, and colours are associated with so many other things besides “sports cars”. So within the ContentDNA based on LSA, the relative score of “red” will be negligible.

However, when we take the neural highway from “Sports Cars” to “Ferrari” there is a clear highway continuing on to “Red”, simply because most Ferraris are painted red. Again, because there is a continuation of the neural highway, we do also associate “sports car” with “Ferrari red” although we have no significant association between “sports cars” and “red”.

When using the theoretical LSA model, the words “Porsche”, “Ferrari”, “speed”, etc. will have a score in the Term Space model but the combination “Ferrari red” will not appear, so to solve this problem we will have to extend the model to include term-term combinations.

Indeed we could further extend it to include three word combinations, for example: radio – CD player – CDs


 6.3.3. Adjusted ContentDNA based on Latent Semantic Indexing using term-term relations

When you implement Latent Semantic Indexing taking into consideration term-term relations it will result in a ContentDNA which more closely approaches the Common Neurolinguistic Map than ContentDNA that ignores term-term relations.

Consider our visual example of human conceptual knowledge – to try to quantify the cities around the central city of “Sports Cars” you could give each a score based on their accessibility. We could determine accessibility using the following parameters:

•  The type of road/ highway which goes to the nearby city: a larger road will allow us to drive faster and will also allow more traffic For example, the highway going from “Sports Cars” to “Ferrari” will be a lot faster then the road going to “CDs”

•  The distance between the cities: it takes less time to travel to cities that are located very close to “Sports Cars” (such as “Radio”) than to those further away (such as “Exhaust”).

•  Whether direct or connected accessibility: if you have a direct road or highway going to the city you want to reach, this will be faster than having to travel via another city. “Radio”, for example, is directly accessible, but to go to “Parts” you have to pass through “Tuning”.

Using these parameters you could score each road in terms of traffic capability, taking into consideration not only the cities that are directly connected to “Sports Cars”, but also those situated two or three cities away.

This might then be the ContentDNA of our visual example:


Term LSI based on term-term relations
Ferrari 0.89
Porsche 0.82
Porsche Germany 0.71
speed 0.67
Ferrari red 0.62
tuning parts 0.53
exhaust systems 0.51
exhaust mufflers 0.45
radio amplifier 0.42
radio CD player CDs 0.35
tuning 0.14
radio 0.12
exhaust 0.11
CDs 0.02
red 0.01
mufflers 0.01

 


This is how the ContentDNA would look:


The model based on Latent Semantic Indexing using term-term relations gives an improved ContentDNA – one more closely matching the human Semantic Neural Network for a given search term.

Should we decide to take the top six terms as the ideal content, this is how the ContentDNA would look:

1 Ferrari
2 Porsche
3 Porsche Germany
4 speed
5 Ferrari red
6 tuning parts


This example shows clearly the necessity of extending LSA with term-term relations. Since LSA does not offer the possibility of measuring relations between words and word combinations – because the matrix structure only contains single words of the lexicon – a lot of relevant information is missed.

Furthermore, LSA makes a total abstraction of syntax. It only works with single words, which can be in random order – the word order is given no importance.

Because pure LSA can only work with single words, and neglects syntax completely, there is a need for a model that can give a ContentDNA which better approximates the human brain, or the Semantic Neural networking.

 
    


6.4. Latent Semantic Indexing model of the second generation

 

 6.4.1. LSI model of the second generation developed by the major search engines.

Because of the discrepancy between theoretical LSA and how the human brain works in reality, we assume that Google, Yahoo and perhaps other major search engines have evolved the LSI model as we know it into one that better reflects the human brain.

We call this the Latent Semantic Indexing model of the second generation.


Theory:

Assume a search engine working with a lexicon of 65 million words. The total number of possibilities for two- and three-word combinations including the singe words would be extremely high.

As result, even a search engine with vast storage capacities, extremely fast processors and instantaneous data communication would find this number of combinations far too big to handle within a theoretical LSA/LSI model. Even Google could not manage it.

Therefore all major search engines have developed models that do not perform LSI on all two- and three-word combinations. Recognizing the impossibility, they instead work with a limited selection of double and triple words.

To determine which should be analyzed and which not, all major search engines use a very evolved syntax filter.

Below we describe how such a syntax filter works, based on our knowledge and testing of the Google database. This necessarily involves some assumption – we do not claim to know the exact details of how Google’s syntax filter works – but we are sure that this corresponds closely with the reality.


 6.4.2. Assumptions of a Content Syntax Filter

As already mentioned, the syntax – the way sentences are formed by the position and order of the words – matters in semantic analysis and hence in the determination of ContentDNA.

Some very complex programs have been developed, based on our current knowledge of Artificial Intelligence, in order to enable computers or robots to understand human languages.

These programs are still experimental, however, and a long way from being ready for use by large-scale search engines that require reliable stable processes.

Fortunately for all the major search engines they don’t in fact need this technology: the syntax of a given language for a specific Search Environment is actually hidden in the content of the database itself.

In other words, because of the vast amount of content loaded daily into their databases, search engines can easily determine which word or word combinations are more frequently used, and by consequence the semantic correctness.

We call this the Content Syntax Index (CSI).


 6.4.3. The Content Syntax Index (CSI)

As explained earlier, we consider the content of the global internet to be a reflection of the Common Neurolinguistic Map. But because people, countries, regions, languages etc. differ across the globe, working with the global internet as a reflection of the Common Neurolinguistic Map makes no sense.

We have already concluded that we have to split up the Global internet by country (or state) and language in order to reflect the Common Neurolinguistic Map for a specific country and language.

Example:


Google Database Language selection Common Neurolinguistic Map
     
www.google.co.uk English English-speaking Englishman
www.google.ca English English-speaking Canadian
www.google.ca French French-speaking Canadian
www.google.fr French French-speaking Frenchman
www.google.de German German-speaking German
www.google.co.za English English-speaking South African
www.google.co.za Afrikaans Afrikaans-speaking South African

We will use the first setup as example:

 Search Environment:

Google Database: www.google.co.uk
Language selection browser: English
Default language selection search engine: English
Surfer location based on IP address: UK

 

This search environment should reflect the Common Neurolinguistic Map of all the English-speaking people in the UK: the Englishman.

We assume that the majority of the websites, or more specifically the content of those websites in the database of www.google.co.uk, is written according to the common syntax rules of the Englishman.

A multitude of poor quality or spam-laden websites would degrade this statement, of course. However this is overcome by Google in their ranking tools: ‘crappy’, badly-written websites with spelling mistakes and faulty HTML will have less impact on the Content Syntax Index than websites free of these mistakes.

By using the content in their database, search engines like Google can first weight the semantic accuracy of the double and triple words. Secondly, they can make a selection of the most accurate two- and three-word combinations.

This limited number of combinations is used in the LSI model, resulting in the LSI model of the second generation.

 
    


6.5 Latent Semantic Indexing model of the second generation to extract the ContentDNA


The LSI model of the second generation follows a three-step process as explained below. The result of this process is the ideal content or ContentDNA.


 6.5.1. First step of the LSI model of the second generation

We will explain the first step with an example:


Search term: Manchester

Search Environment:

 
Google Database: www.google.co.uk
Language selection browser: English
Default language selection search engine: English
Surfer location based on IP address: UK

The first step consists of determining the semantic relevance of the double and triple words to be used. This is based on the Content Semantic Index as explained above, to determine which double and triple words are relevant to be used in the LSI model.

Example of double and triple word combinations that are semantically correct and can be used in the LSI model:


Manchester United
Fair play
Old town
Winning game
World cup finals
Football is cool
Cool sport
David Beckham
Sports news
Good team spirit
Ball game
Big city
Old town
City town hall

The LSI score of these terms has not yet been decided, so we cannot yet say whether or not they will make up a part of the ContentDNA. The only thing we can conclude at this stage is that these double and triple word combinations are semantically correct and that they potentially make up part of the ContentDNA for the search term “Manchester” for the defined search environment.


 6.5.2. Second step of the LSI model of the second generation

We assume that all of today’s major search engines are using an evolved model of LSI in order to determine LSI scores. The LSI model of the second generation makes a matrix of all the single words of the lexicon as well as the double and triple word combinations derived from the first step.

This group of words and word combinations is processed in the LSI model of the second generation.

The result for our example could look like this:


Term LSI score
   
United 0.78
Football 0.71
Game 0.68
World cup finals 0.67
Man United 0.66
Town 0.12
UK 0.11
David Beckham 0.9
News 0.03
Milan 0.25


This shows that there is a big content-relevant association between ‘Manchester’ and ‘United’ and between ‘Manchester’ and ‘Football’.

Unsurprisingly, the word combination “Man United” also has a big correlation with “Manchester”

There is a lower correlation with ‘News’ and ‘Milan’ but these words still make up a part of the ContentDNA. The single and double words “Big city” and “old town” do not make it into the ContentDNA, although they are semantically correct combinations.


6.5.3. Third step of the LSI model of the second generation

The third step of the LSI process involves a correction based on the local and global term weighting.

The exact explanation of how this works is beyond the scope of this paper and would also be hypothetical – since the major search engines do not of course reveal how they perform their term weighting. What we can state in this paper is that there is a need to filter out the ‘on topic terms’ from the ‘general terms’. This is done through “local term weighting” and “global term weighting”.

Local Term Weighting:

•  The number of times a single word appears in a document: words that appear several times in the same document are most likely more important than words that appear only once or twice. This is often referred to as ‘local weighting’. This is also adjusted with the location of the word in the document and its syntax value: taking into account, for example, whether the word appears in a title, first paragraph or last sentence, etc.

Global Term Weighting

•  The number of times a single word appears in all of the documents of a search environment. Words that are used frequently in a lot of documents are probably more general terms and less important than words that are only used a couple of times in a couple of websites that tend to be more specific and topic related. This is often referred to as Global Term Weighting.

The further explanation of how Local Weighting and Global Term Weighting works is beyond the scope of this paper, however we can state that this process is used to filter words that are ‘on topic terms’ from those that our ‘general terms’.

The third step of the LSI model of the second generation is the weighting of the list from step 2 towards local and global weighting.

The result for our example could look like this:


Term LSI score Local weighting Global weighting Total
United 0,78 0,90 0,89 0,62
Football 0,71 0,89 0,54 0,34
Game 0,68 0,70 0,35 0,17
World cup finals 0,67 0,40 0,25 0,07
Man United 0,66 0,65 0,80 0,34
Town 0,12 0,30 0,30 0,01
UK 0,11 0,85 0,50 0,05
David Beckham 0,90 0,56 0,23 0,12
News 0,15 0,20 0,05 0,00
Milan 0,25 0,28 0,30 0,02

Corrected result with global and local weighting:

Term Total
United 0,62
Man United 0,34
Football 0,34
Game 0,17
David Beckham 0,12
World cup finals 0,07
UK 0,05
Milan 0,02
Town 0,01
News 0,00

You can see that “David Beckham” receives a higher position in the ContentDNA owing to a relatively good local weighting. Conversely, the word “town” has a lowered position in the ContentDNA thanks to a lower local and global term weighting.

The final result of these three steps based on LSI of the second generation result in the final ContentDNA as it is used by all major search engines to determine the most ideal content for a search term for a certain search environment.

 
    7. The use of Latent Semantic Optimization to approach the original ContentDNA from the search engines as closely as possible – a theoretical model
 

        7.1. Purpose of approaching the ContentDNA using Latent Semantic Optimization or LSO.

Basically, Latent Semantic Optimization or LSO will attempt to approach the ContentDNA for a certain search term or keyword as closely as possible. The purpose of this approach is to recover the ideal content or ContentDNA for this search term.

When you know what the ContentDNA should look like you can write the content of your webpage based on this content.

This will result in a higher ranking from the search engines because the content of your webpage is a good match for the ideal content or ContentDNA.

This means that Latent Semantic Optimization, or LSO, can be – and will be – applied in the near future for Search Engine Optimization, or SEO.

The closer the content of your webpage matches the ContentDNA, the higher the search engine concerned will rank your webpage on the content score for the specific search term.


Example:



Original ContentDNA

The ContentDNA can be visually presented as a graph with, on the horizontal axis, the total lexicon for a certain language: e.g. English, extended with the double and triple word combinations selected with the Content Syntax Index (CSI).

On the vertical axis is the score of the words and word combinations based on the Latent Semantic Indexing model of the second generation

This is a visual example:

The original ContentDNA as it is calculated by all major search engines using a technique based on Latent Semantic Indexing model of the second generation. This model results in the Original ContentDNA for a given search term in a given search environment.

Webpage with a good approximation of the Original ContentDNA

To visualize the ContentDNA of an existing webpage you would start with the horizontal axis of the Original ContentDNA as described above and score every word and double or triple word combination of the webpage according to the relative importance of its appearance on the webpage.

The procedure of weighting the words and word combinations on a webpage to come to a score of relative importance is no exact science and open for interpretation. The weighting is mostly based on the following three parameters:

•  The HTML tags: e.g. H1 title, Bold, Italic, Alt tag, etc.

•  The location of the word in the text: first paragraph, top, bottom, etc.

•  The number of times the word or word combination is used in the text

Below you can see the ContentDNA of a webpage example that closely matches the original ContentDNA for a certain search term. The webpage will receive a good score for its content for this specific search term and given search environment

The content of this webpage or website matches the original ContentDNA very closely. As a result this webpage or website will get a good score from the major search engines for its content for a given search term in a given search environment. The close match to the ContentDNA is based on Latent Semantic Optimization


Webpage with a less good approximation of the Original ContentDNA



Below you can see an example of a webpage with a ContentDNA that is less successful at matching the original ContentDNA according to the search engines for a given search term in a given search environment.


This webpage will receive a lower score for its content than the webpage described above.

This webpage gets a lower content score because the ContentDNA does not match the original ContentDNA very closely. Using a tool based on Latent Semantic Optimization, the content of the webpage could be improved – and hence the search engine ranking position or SERP. This is a new application for Search Engine Optimization.

There are of course many other parameters that help determine the ranking of your webpage for a certain search term. For example: the number of back links, the click through rate, being up to date, etc. It is however commonly agreed by many SEO experts that the content of a webpage determines around 50% of the ranking for all major search engines.

 

        7.2. Theoretical model of Latent Semantic Optimization to approach the ContentDNA

Latent Semantic Optimization uses the Latent Semantic Indexing Model of the Second Generation to backward-extract the ContentDNA for any