Fit profits rateюHow to calculate the similarity between two words/strings.

The string similarity algorithm was created to meet these specifications:

A real representation of lexical similarity – chain with lightweight variations must certanly be seen as becoming comparable. In particular, a substantial sub-string convergence should indicate increased amount of similarity between your strings.
A robustness to modifications of term order- two chain that have the exact same statement, but in a different sort of purchase, need named becoming comparable. Conversely, if an individual sequence simply a random anagram of this characters contained in the more, this may be should (usually) become named dissimilar.
Language freedom – the algorithm should operate not only in English, but additionally in several dialects.

Solution

The similarity is determined in three measures:

Partition each sequence into a list of tokens.
Computing the similarity between tokens through a sequence edit-distance formula (expansion feature: semantic similarity description utilising the WordNet collection).
Processing the similarity between two token records.

There clearly was another discussion for your reference.

A better similarity positioning formula for adjustable duration chain

Thanks all to suit your assist and guidelines.

Martin Xie [MSFT] MSDN society help | suggestions to united states bring or Request laws Sample from Microsoft Kindly make every effort to draw the responses as solutions if they assist and unmark all of them if they create no assist.

Marked as address by Martin_Xie Monday, September 26, 2011 8:48 AM

All replies

Something their question,explain they considerably more specific,i have confused with our

For-instance “a_logfile.txt” and “logfile_a.txt” must certanly be most similiar and aswell “loga_file.txt” and “logfile.text” not “myText.txt” and “logfile.txt”

When it resolved your problem,Please mouse click “level like Answer” on that article and “tag as Helpful”. Delighted Development!

Okay we check it out again 🙂

Better I do want to contrast filenames and I also want to get a portion numbers in exactly how similiar they’ve been. I do not know if this will be possible after all.

For example a filename “a_filename.txt” and “filename_a.txt” is very similiar for all of us but how should I obtain the exact same benefit programmatically.

Another instance filename “file_abc_.txt” and fil_abc_e.txt” normally similiar but once again how do I get the result programmaticaly

That is potentially more challenging than it seems at first.

Have a look at http://en.wikipedia.org/wiki/String_metrics and follow some of the hyperlinks.

Concerns David Roentgen Every program ultimately becomes rococo, right after which rubble. – Alan Perlis The only good measurement of code quality: WTFs/minute.

This is MSDN Discussion Board.

This short article shows the answer about: how exactly to Compute the similarity between two words/strings. The algorithm was developed in C# and you may install the demo inside.

The sequence similarity algorithm originated to meet the next requirements:

A true expression of lexical similarity – strings with lightweight variations must thought to be getting close. Specifically, an important sub-string overlap should suggest a higher level of similarity involving the strings.
A robustness to changes of keyword order- two strings that have the same keywords, in yet another purchase, should always be named are similar. Having said that, if a person string is just a random anagram on the characters included in the various other, then it should (usually) be recognized as dissimilar.
Words independency – the formula should operate not only in English, additionally in a variety of languages.

Option

The similarity are determined in three measures:

Partition each sequence into a summary of tokens.
Computing the similarity between tokens with a string edit-distance formula (expansion element: semantic similarity measurement utilising the WordNet collection).
Processing the similarity between two token databases.

There can be another conversation for the resource.

An improved similarity positioning formula for variable length strings

Thanks a lot all to suit your services and pointers.

Martin Xie [MSFT] MSDN neighborhood service | opinions to you bring or Request rule test from Microsoft Kindly don’t forget to mark the responses as solutions if they assist and unmark all of them should they give no help.

Marked as answer by Martin_Xie Monday, September 26, 2011 8:48 was

We have written a signal for my venture to recognize comparable brands around from database.

1st we used the DIFFERENCE(string1, string2)>=4 purpose of SQL server but it didn’t assist me because like when first-name is “21” and second title was actually “21 hop road” the end result contained two brands whereas demonstrably they don’t actually similar. therefore the consequences group of these a query included over 700 standards which was very poor in this case.

I then discover a similar HUGE DIFFERENCE features for c# that has been almost the same as SQL form of that function. for example they coordinated the similarity of “asdcdfsdfgdsgdg” and “asdewwetqwetrwe” as best which clearly false.

then I created a class for this concern to obtain more efficient similarity between strings.

the name of the lessons was StringCompare and is an overview of this course:

UNDERSTANDING STRING EXAMINE?

StringCompare try a contrasting software for chain. Maybe not an ordinal evaluation, but a member of family comparison that identifies exactly how much two chain tend to be close or how much not close.

By place the favorable tradeoff prices you can get a good comparison for chain.

HOW TO USE:

Very first you will want to setup an example of StringCompare with tradeoff prices or default tradeoff standards.

You can find 4 beliefs that may be set:

1. MinSimilarityLong:

Here is the lowest appropriate percentage of similarity between two strings that evaluating with StringCompare. This advantages can be used for chain utilizing the duration of no less than 8.

2. MinSimilarityShort:

Here is the lowest appropriate portion of similarity between two strings that researching with StringCompare. This advantages can be used for chain utilizing the length below 8.

3. MaxToleranceLong:

This is actually the optimum acceptable portion of threshold between two chain that comparing with StringCompare. This value is utilized for chain because of the length of at least 8.

4. MaxToleranceShort:

Here is the maximum appropriate percentage of endurance between two strings that evaluating with StringCompare. This advantages is used for strings with all the duration below 8.

* after you have produced a case you’ll phone InstanceName.IsEqual (string1, string2) to discover the equivalence of two strings.

* give consideration to that equivalence is actually relative to the minSimilarty and maxTolerance your set earlier.

* give consideration to that higher minSimilarity beliefs will result in most constrained effects and the other way around.

* start thinking about that decreased maxTolerance values can lead to more restricted outcome and the other way around.

Like: