I came across a problem in CQ5.6.1 installation where there 50 GB content repository primarily CQ5:Page and PDF document on publish side and Custom Search API has been written using CQ query Builder API and XPATH query. What query does is it runs query taking a search terms single or multiple search keywords.
Lucene Fuzzy search has been used to bring similar search results which gives good amount of result set but other side it introduce high response time because of fuzzy nature.
Since Publish are tar installation and Lucene are used for binary indexing not Solr or any external search API, Lucene binary is in couple of GB and because of PDFs and other binary contents, XPATH query with fuzzy response is very high and results are very ambiguous even tough , Boosting has been applied on certain specific properties.
It is decided not index PDF and other binary content on the publish side. These are the steps done.
1.Make changes in workspace.xml.
<SearchIndex class=”com.day.crx.query.lucene.LuceneHandler”>
<param name=”path” value=”${wsp.home}/index”/>
<param name=”resultFetchSize” value=”50″/>
<param name=”indexingConfiguration” value=”${wsp.home}/indexing_config.xml”/> <param name=”tikaConfigPath” value=”${wsp.home}/tika-config.xml”/> <param name=”supportHighlighting” value=”false”/>
<param name=”spellCheckerClass” value=”com.day.crx.core.query.spell.CRXSpellChecker$OneMinuteRefreshInterval”/> </SearchIndex>
Here is index_config.xml
<?xml version=”1.0″?>
<!DOCTYPE configuration SYSTEM “http://jackrabbit.apache.org/dtd/indexing-configuration-1.2.dtd”>
<configuration
xmlns:cq=”http://www.day.com/jcr/cq/1.0″
xmlns:dam=”http://www.day.com/dam/1.0″
xmlns:nt=”http://www.jcp.org/jcr/nt/1.0″
xmlns:jcr=”http://www.jcp.org/jcr/1.0″
xmlns:sling=”http://sling.apache.org/jcr/sling/1.0″>
<!– Do not index content of subassets –>
<index-rule nodeType=”nt:resource”
condition=”ancestor::subassets/@jcr:primaryType='{http://www.jcp.org/jcr/nt/1.0}unstructured'”>
</index-rule>
<index-rule nodeType=”cq:PageContent”>
<property boost=”5.0″>jcr:title</property>
<property boost=”3.0″>Keywords</property>
<property boost=”3.0″>jcr:description</property>
</index-rule>
<!–
Exclude some well known properties from the node scope
fulltext index. Do not add rules below this one, since
this rule matches any node and acts as a default/fallback.
–>
<index-rule nodeType=”nt:base”>
<property nodeScopeIndex=”false”>analyticsProvider</property>
<property nodeScopeIndex=”false”>analyticsSnippet</property>
<property nodeScopeIndex=”false”>hideInNav</property>
<property nodeScopeIndex=”false”>offTime</property>
<property nodeScopeIndex=”false”>onTime</property>
<property nodeScopeIndex=”false”>cq:allowedTemplates</property>
<property nodeScopeIndex=”false”>cq:childrenOrder</property>
<property nodeScopeIndex=”false”>cq:cugEnabled</property>
<property nodeScopeIndex=”false”>cq:cugPrincipals</property>
<property nodeScopeIndex=”false”>cq:cugRealm</property>
<property nodeScopeIndex=”false”>cq:designPath</property>
<property nodeScopeIndex=”false”>cq:isCancelledForChildren</property>
<property nodeScopeIndex=”false”>cq:isDeep</property>
<property nodeScopeIndex=”false”>cq:lastModified</property>
<property nodeScopeIndex=”false”>cq:lastModifiedBy</property>
<property nodeScopeIndex=”false”>cq:lastPublished</property>
<property nodeScopeIndex=”false”>cq:lastPublishedBy</property>
<property nodeScopeIndex=”false”>cq:lastReplicated</property>
<property nodeScopeIndex=”false”>cq:lastReplicatedBy</property>
<property nodeScopeIndex=”false”>cq:lastReplicationAction</property>
<property nodeScopeIndex=”false”>cq:lastReplicationStatus</property>
<property nodeScopeIndex=”false”>cq:lastRolledout</property>
<property nodeScopeIndex=”false”>cq:lastRolledoutBy</property>
<property nodeScopeIndex=”false”>cq:name</property>
<property nodeScopeIndex=”false”>cq:parentPath</property>
<property nodeScopeIndex=”false”>cq:segments</property>
<property nodeScopeIndex=”false”>cq:siblingOrder</property>
<property nodeScopeIndex=”false”>cq:template</property>
<property nodeScopeIndex=”false”>cq:trigger</property>
<property nodeScopeIndex=”false”>cq:versionComment</property>
<property nodeScopeIndex=”false”>jcr:createdBy</property>
<property nodeScopeIndex=”false”>jcr:lastModifiedBy</property>
<property nodeScopeIndex=”false”>sling:alias</property>
<property nodeScopeIndex=”false”>sling:resourceType</property>
<property nodeScopeIndex=”false”>sling:vanityPath</property>
<property isRegexp=”true”>.*:.*</property>
</index-rule>
<!– Cq Page for jcr:contains(jcr:content, “…”) searches –>
<aggregate primaryType=”cq:PageContent”>
<include>*</include>
<include>*/*</include>
<include>*/*/*</include>
<!– <include>*/*/*/*</include>–>
</aggregate>
<aggregate primaryType=”dam:Asset”>
<include>jcr:content</include>
<include>jcr:content/metadata</include>
<include>jcr:content/metadata/*</include>
<!– <include>jcr:content/renditions</include>
<include>jcr:content/renditions/original</include>
<include>jcr:content/renditions/original/jcr:content</include>
<include>jcr:content/renditions/original/jcr:content/jcr:lastModified</include>–>
</aggregate>
<!– nt:file child axis orderby index –>
<aggregate primaryType=”nt:file”>
<include>jcr:content</include>
<include>jcr:content/jcr:lastModified</include>
</aggregate>
<!– cq:Page child axis orderby index –>
<aggregate primaryType=”cq:Page”>
<include>jcr:content</include>
<include>jcr:content/cq:lastModified</include>
</aggregate>
</configuration>
Here is the Tika XML
<?xml version=”1.0″ encoding=”UTF-8″?>
<!–
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the “License”); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an “AS IS” BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
–>
<properties>
<mimeTypeRepository resource=”/org/apache/tika/mime/tika-mimetypes.xml” magic=”false”/>
<parsers>
<!–<parser name=”parse-dcxml” class=”org.apache.tika.parser.xml.DcXMLParser”>
<mime>application/xml</mime>
<mime>image/svg+xml</mime>
</parser>–>
<parser name=”parse-html” class=”org.apache.tika.parser.html.HtmlParser”>
<mime>text/html</mime>
<mime>application/xhtml+xml</mime>
<mime>application/x-asp</mime>
</parser>
<parser name=”parse-txt” class=”org.apache.tika.parser.txt.TXTParser”>
<mime>text/plain</mime>
</parser>
<parser class=”org.apache.tika.parser.DefaultParser”/>
<parser class=”org.apache.tika.parser.EmptyParser”>
<!– Disable package extraction as it’s too resource-intensive –>
<mime>application/x-archive</mime>
<mime>application/x-bzip</mime>
<mime>application/x-bzip2</mime>
<mime>application/x-cpio</mime>
<mime>application/x-gtar</mime>
<mime>application/x-gzip</mime>
<mime>application/x-tar</mime>
<mime>application/zip</mime>
<mime>application/x-tika-msoffice</mime>
<mime>application/msword</mime>
<mime>application/vnd.ms-excel</mime>
<mime>application/vnd.ms-excel.sheet.binary.macroenabled.12</mime>
<mime>application/vnd.ms-powerpoint</mime>
<mime>application/vnd.visio</mime>
<mime>application/vnd.ms-outlook</mime>
<mime>application/xml</mime>
<mime>application/pdf</mime>
<mime>application/x-tika-ooxml</mime>
<mime>application/vnd.openxmlformats-package.core-properties+xml</mime>
<!–
<mime>application/vnd.openxmlformats-officedocument.spreadsheetml.sheet</mime>
–>
<mime>application/vnd.openxmlformats-officedocument.spreadsheetml.template</mime>
<mime>application/vnd.ms-excel.sheet.macroenabled.12</mime>
<mime>application/vnd.ms-excel.template.macroenabled.12</mime>
<mime>application/vnd.ms-excel.addin.macroenabled.12</mime>
<mime>application/vnd.openxmlformats-officedocument.presentationml.presentation</mime>
<mime>application/vnd.openxmlformats-officedocument.presentationml.template</mime>
<mime>application/vnd.openxmlformats-officedocument.presentationml.slideshow</mime>
<mime>application/vnd.ms-powerpoint.presentation.macroenabled.12</mime>
<mime>application/vnd.ms-powerpoint.slideshow.macroenabled.12</mime>
<mime>application/vnd.ms-powerpoint.addin.macroenabled.12</mime>
<mime>application/vnd.openxmlformats-officedocument.wordprocessingml.document</mime>
<mime>application/vnd.openxmlformats-officedocument.wordprocessingml.template</mime>
<mime>application/vnd.ms-word.document.macroenabled.12</mime>
<mime>application/vnd.ms-word.template.macroenabled.12</mime>
<mime>application/java-vm</mime>
<mime>application/rtf</mime>
<mime>application/mbox</mime>
<mime>application/epub+zip</mime>
<mime>application/x-midi</mime>
<mime>application/vnd.sun.xml.writer</mime>
<mime>application/vnd.oasis.opendocument.text</mime>
<mime>application/vnd.oasis.opendocument.graphics</mime>
<mime>application/vnd.oasis.opendocument.presentation</mime>
<mime>application/vnd.oasis.opendocument.spreadsheet</mime>
<mime>application/vnd.oasis.opendocument.chart</mime>
<mime>application/vnd.oasis.opendocument.image</mime>
<mime>application/vnd.oasis.opendocument.formula</mime>
<mime>application/vnd.oasis.opendocument.text-master</mime>
<mime>application/vnd.oasis.opendocument.text-web</mime>
<mime>application/vnd.oasis.opendocument.text-template</mime>
<mime>application/vnd.oasis.opendocument.graphics-template</mime>
<mime>application/vnd.oasis.opendocument.presentation-template</mime>
<mime>application/vnd.oasis.opendocument.spreadsheet-template</mime>
<mime>application/vnd.oasis.opendocument.chart-template</mime>
<mime>application/vnd.oasis.opendocument.image-template</mime>
<mime>application/vnd.oasis.opendocument.formula-template</mime>
<mime>application/x-vnd.oasis.opendocument.text</mime>
<mime>application/x-vnd.oasis.opendocument.graphics</mime>
<mime>application/x-vnd.oasis.opendocument.presentation</mime>
<mime>application/x-vnd.oasis.opendocument.spreadsheet</mime>
<mime>application/x-vnd.oasis.opendocument.chart</mime>
<mime>application/x-vnd.oasis.opendocument.image</mime>
<mime>application/x-vnd.oasis.opendocument.formula</mime>
<mime>application/x-vnd.oasis.opendocument.text-master</mime>
<mime>application/x-vnd.oasis.opendocument.text-web</mime>
<mime>application/x-vnd.oasis.opendocument.text-template</mime>
<mime>application/x-vnd.oasis.opendocument.graphics-template</mime>
<mime>application/x-vnd.oasis.opendocument.presentation-template</mime>
<mime>application/x-vnd.oasis.opendocument.spreadsheet-template</mime>
<mime>application/x-vnd.oasis.opendocument.chart-template</mime>
<mime>application/x-vnd.oasis.opendocument.image-template</mime>
<mime>application/x-vnd.oasis.opendocument.formula-template</mime>
<!– Disable image extraction as there’s no text to be found –>
<mime>image/bmp</mime>
<mime>image/gif</mime>
<mime>image/jpeg</mime>
<mime>image/png</mime>
<mime>image/vnd.wap.wbmp</mime>
<mime>image/x-icon</mime>
<mime>image/x-psd</mime>
<mime>image/x-xcf</mime>
<mime>audio/mpeg</mime>
<mime>audio/basic</mime>
<mime>audio/x-wav</mime>
<mime>audio/x-aiff</mime>
<mime>audio/midi</mime>
</parser>
</parsers>
</properties>
2. Delete the directory Aem 5.6\crx-quickstart\repository\workspaces\crx.default\index and this Aem 5.6\crx-quickstart\repository\repository\index
start the CQ5…for me 50 GB repository took 2 hours to be reindex.
3. after server is up, i went in DAM and search some keyword from PDF and result did not come means PDF binary index stopped and Binary Lucne index size reduced from 3 GB to 500 MB.
This also improved the response time for search request.
I have noticed you don’t monetize jcr-nosql.com, don’t waste your traffic, you can earn extra cash every month with
new monetization method. This is the best adsense alternative for any
type of website (they approve all websites), for more details simply search in gooogle: murgrabia’s tools