Category: apache

Spell Check Lucene in AEM 5.6.1

I am trying to implement Spell Check for one of content serach application when they do search since application is heavily based on JCR query using Query builder users pay for the content.

They need smart search features also “do you means this” features just like google.

Adobe has invested good amount of effort and and money to update/create the knowledge document but they still lacks in many module which are not core of the AEM.

Sales force connector is one the module which i believe has never worked with unlimited version of sales force. Not only me, I talked to couple of my friends they  promised customer to provide the Sales force integration out of box using sales force Template it works with Free SF edition but never with unlimited edition . Adobe support or Adobe forum users are clueless and end up writing custom Sales force rest API or WSDL to talk to AEM..that’s the one story.

Coming back to the Spell Check out of the box in AEM 5.6.1.. I wanted to give “Do mean this this Feature” to the search user and i enabled spellchecker module in workplace.xml.

<SearchIndex class=”com.day.crx.query.lucene.LuceneHandler”>
            <param name=”path” value=”${wsp.home}/index”/>
            <param name=”resultFetchSize” value=”50″/>
            <param name=”indexingConfiguration” value=”${wsp.home}/indexing_config.xml”/>
            <param name=”tikaConfigPath” value=”${wsp.home}/tika-config.xml”/>
            <param name=”supportHighlighting” value=”false”/>
            <param name=”spellCheckerClass” value=”com.day.crx.core.query.spell.CRXSpellChecker$OneMinuteRefreshInterval”/>
        </SearchIndex>

I ran the index which took couple of hours and then finally i got this created crx-quickstart\repository\workspaces\crx.default\index\spellchecker.

I wrote a Querybuilder code like this..

final QueryManager manager = session.getWorkspace().getQueryManager();
            Query query = manager.createQuery(“/jcr:root[rep:spellcheck(‘”+term+”‘)]/(rep:spellcheck())”, Query.XPATH);
            RowIterator rows = query.execute().getRows();
            // the above query will always return the root node no matter what string we check
            Row r = rows.nextRow();
            // get the result of the spell checking
            Value v = r.getValue(“rep:spellcheck()”);
            if (v == null) {
                termNew = term;
            } else {
                 termNew = v.getString();
            }          
        }
        catch(Exception ex){
            System.out.println(ex.getMessage() +”111″);
            log.error(“error caught in getSpelledChecked”,ex.getMessage());
        }
        System.out.println(” Source >> “+ term + ” Suggestion>> “+termNew);

And here the funny Output. If see some of the suggestion is good but learning comes as earning and manger comes as wagner :). This dictionary is not at all usable.

suggestion

Here is the most painful area.

Custom Spell Check Solution Did you means in AEM

Since Out of box solution in AEM for Spell Check seems not usable , I decided to use Lucene Spell Check API direly. rather CQ5 search APIw hich is wrapper on the of Lucene.

Here is the details of existing  CQ5 search bundle.
Bundle-Name: Social UGC Search Collections – Bundle which provides out of box spell check capabilities and lucene-core-3.6.1.jar is included in that bundle as supporting jar.

Here is the Custom Code

package com.xyz.util;
import java.io.File;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.spell.PlainTextDictionary;
import org.apache.lucene.search.spell.SpellChecker;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Dictionary {
public static void main(String[] args) throws Exception {
File dir = new File(“B:/Projects/download/dic”);
Directory directory = FSDirectory.open(dir);
SpellChecker spellChecker = new SpellChecker(directory);
spellChecker.indexDictionary(new PlainTextDictionary(new File(“B:/Projects/dictionary/fulldictionary00.txt”)),
new IndexWriterConfig(Version.LUCENE_CURRENT,new StandardAnalyzer(Version.LUCENE_CURRENT)), false);
String wordForSuggestions = “mv money” ;
int suggestionsNumber = 1;
String[] suggestions = spellChecker.
suggestSimilar(wordForSuggestions, suggestionsNumber);
if (suggestions!=null && suggestions.length>0) {
for (String word : suggestions) {
System.out.println(“Did you mean:” + word);
}
}
else {
System.out.println(“No suggestions found for word:”+wordForSuggestions);
}
}
}

 

Dictionary sample

marketing on demand
compliancemax
enhanced trading
learning center
move money
compliance max

When you would deploy above code in AEM OSGI bundle, You would face org.apache.lucene.analysis.standard are not resolved. To resolve that you need to make changes in your maven
<plugin>
<groupId>org.apache.felix</groupId>
<artifactId>maven-bundle-plugin</artifactId>
<extensions>true</extensions>
<configuration>
<instructions>
<Bundle-Category>xyz</Bundle-Category>
<Import-Package>
*
</Import-Package>
<Export-Package>
com.xyz.*,org.apache.lucene.*,org.tartarus.snowball.*
</Export-Package>
</instructions>
</configuration>
</plugin>

So now if you search mv money then suggestion would come like move money.

hope this will help.

 

 

 

Advertisement

JCR Content Model Some Key Rule

Data Store

The data store holds large binaries. On write, these are streamed directly to the data store and only an identifier referencing the binary is written to the Persistence Manager (PM) store. By providing this level of indirection, the data store ensures that large binaries are only stored once, even is they appear in multiple locations  within the content in the PM store. In effect the data store is an implementation detail of the PM store. Like the PM, the data store can be configured to store its data in a file system (the default) or in a database. The minimum object length default is 100 bytes;; smaller objects are stored inline(not in the data store). The maximum value is 32000 because Java does not support strings longer than 64 KB in writeUTF.

Cluster Journal

Whenever CRX writes data it first records the intended change in the journal. Maintaining the journal helps ensure data consistency and helps the system to recover quickly from crashes. As with the PM and data stores, the journal can be stored in a file system (the default) or in a database.

Persistence Manager
Each workspace in the repository can be separately configured to store its data through a specific persistence manager (the class that manages the reading and writing of the data). Similarly, the repository-wide version store can also be
independently configured to use a particular persistence manager. A number of different persistence managers are available, capable of storing data in a variety of file formats or relational databases.

Query Index

CRX’s inverse index is based on Apache Lucene. This allows for:

Most index updates are synchronous. Long full text extraction tasks are handled in background. Other cluster nodes will update their indexes at next cluster sync Everything indexed by default. You can tweak the indexing configuration for improvements in indexing functionality, performance and disk usage. There is one index per workspace (and one for the shared version store) Indexes are not shared across a cluster, indexes are local to each cluster node.

Jackrabbit
The Apache Jackrabbit™ content repository is a fully conforming implementation of the Content Repository for Java Technology API (JCR, specified in JSR 170 and JSR 283. Note the next release of the JCR specification is JSR 333, which is currently under work.