Company name extraction and custom entity extraction with SharePoint search
avatar

Introduction

In this blog post I would like to talk about two important SharePoint search features that can greatly increase the findability of documents in SharePoint. When it comes to findability in SharePoint, metadata plays a very important role. Unfortunately, in a lot of scenarios there aren’t any content types or custom columns, where additional metadata for documents gets stored. At first glance, this may be a problem, but actually most of the times the needed metadata can be found in the document itself.

SharePoint Search offers two features called “Company name extraction” and “Custom entity extraction” that help you to get the needed metadata out of the document to use it in search queries or to refine search results. The features are extracting the metadata out of the information that can be found in managed properties, so both features have to be activated in the desired managed properties.

Please be aware that these two features are only available in the SharePoint Enterprise edition!

Company name extraction

Company name extraction offers the possibility to extract desired company names from content, for example the body or title of documents.

Before you start using this features you should think about the company names that you like to extract and also about the company names that you don’t like to extract. The name of your company is a good example for a company name that should be excluded, because it will appear in nearly every document and therefore it’s not very helpful when it comes to refining search results or building search queries.

SharePoint already has a prepopulated dictionary for company names that includes a large number of companies (like Microsoft and SAP for example), but you have the chance to include additional company names or to exclude specific company names with the help of two term sets. You can find them in the term group „Search Dictionaries“.

If you want to include a company name just create a new term with the desired company name in the term set „Company Inclusions“. In my case I like to include our demo company name “HanseSystems” and our company name “HanseVision”.

If you want to exclude a company name just create a new term with the desired company name in the term set „Company Exclusions“. In my case I like to exclude the company name “SAP”.

image

After you defined the company inclusions and the exclusions, you have to decide out of which managed properties you like to extract the company names. In my scenario I wanted to extract the company names out of the managed property “Body”. Therefore, I edited the managed property settings and enabled „Company Extraction“ in the area „Company name extraction“.

image

After the configuration of the managed property it’s necessary to run a full crawl.

The extracted company names are copied to the managed property “companies”, this managed property can now be used for the refinement of search results. To add the managed property to the search refinement panel on a search results page you have to switch to the edit mode of the page and go into the settings of the refinement panel webpart. There you have to add „companies“ to the „Selected refiners“ section and move it to the position where you like it to appear.

image

Afterwards you can define the display name for the presentation in the refinement panel as well as the display template and some other settings like the sorting.

image

After saving and publishing the results page, it’s possible to refine the results based on the company names in the body of the documents.

image

Of course it’s also possible to configure search queries with the help of the managed property „companies“.

image

TechNet Article – Manage company name extraction in SharePoint Server 2013
https://technet.microsoft.com/en-us/library/jj591605.aspx

Custom entity extraction

Custom entity extraction offers the possibility to extract any desired entity from content, for example the body or the title of a document.

As you can extract any desired entity, you have to define every entity that you like to extract. These entities have to be saved in a .csv file that gets imported into SharePoint later on. The .csv needs the columns “Key” and “Display form”. In the “Key” column you have to define the entity that you want to extract, the “Display form” column is optional, there you can define how the entity will be displayed in the refiner. When you want to extract “SharePoint 2013” for example, but want to display “SharePoint” in the refinement, the .csv needs to look like this:

Key,Display form
SharePoint 2013,SharePoint

I want to extract the internal names of some of our products, so my .csv looks like this:

Key,Display form
SmartFind,SmartFind
SmartMeeting,SmartMeeting
SmartNavigation,SmartNavigation

Please make sure that there are no leading or trailing spaces around the terms.

After finishing the .csv file, you have to decide which custom entity extraction dictionary you want to use. The type of extraction dictionary defines how entries are matched with content in the search index and which managed property will contain the extracted entities. An overview about the different types with all needed information can be found in this TechNet article: https://technet.microsoft.com/en-us/library/jj219480.aspx

Now the .csv file needs to be imported into the SharePoint Search Service Application. This can be done with the help of the SharePoint Management Shell and the following command:

$searchApp = Get-SPEnterpriseSearchServiceApplication
Import-SPEnterpriseSearchCustomExtractionDictionary –SearchApplication $searchApp –Filename \\JVRC-SP2013\Installation\CEEDictionary.csv –DictionaryName Microsoft.UserDictionaries.EntityExtraction.Custom.Word.1

With the help of the parameter “-DictionaryName” the type and the number of the dictionary gets defined. In my case I’m using a “Word Extraction” dictionary, as it’s my first dictionary the parameter includes “1” at the end. The “Word Extraction” is case-insensitive and only matches content with the exact entity. The entry „anchor“ matches „anchor“ and „Anchor,“ but not „anchorage“.

After you imported the .csv file and thereby defined the entity extraction dictionary, you have to decide out of which managed properties you like to extract the defined entities.

In my scenario I wanted to extract the entities out of the managed property “Body”. Therefore, I edited the managed property settings and enabled „Word Extraction – Custom 1“ in the area „Custom entity extraction“.

Please be aware that you have to enable the entity extractors based on the type of dictionary and the number of the dictionary you created.

image

After the configuration of the managed property it’s necessary to run a full crawl.

The extracted entities are copied to the managed property “WordCustomRefiner1”, this managed property can now be used for the refinement of search results. To add the managed property to the search refinement panel on a search results page you have to switch to the edit mode of the page and go into the settings of the refinement panel webpart. There you have to add “ WordCustomRefiner1″ to the „Selected refiners“ section and move it to the position where you like it to appear.

image

Afterwards you can define the display name for the presentation in the refinement panel as well as the display template and some other settings like the sorting. As I’m extraction our internal product names the display name of the refinement should be “Product”.

image

After saving and publishing the results page, it’s possible to refine the results based on the desired entities in the body of the documents, in my case our internal product names.

image

Of course it’s also possible to configure search queries with the help of the managed property „WordCustomRefiner1“.

image

TechNet Article – Create and deploy custom entity extractors in SharePoint Server 2013
https://technet.microsoft.com/en-us/library/jj219480.aspx

Schreibe einen Kommentar