Apache Solr: how to search sorting accented fields

VN:RO [1.9.11_1134]
quarta-feira, 16 d mar├žo d 2011
Por Rafael Barbolo, Coop10. Siga no Twitter

We are using Apache Solr and Sunspot in a brazilian portuguese project and wanted it to work with accented chars.


Normalizing latin chars (├í, ├ę, ├ž, …) to ASCII chars (a, e, c, …) in our search index and queries was pretty easy. We changed the text field definition to the following schema:



1
2
3
4
5
6
7
8
9
<fieldType name="text" class="solr.TextField" omitNorms="false">
	<analyzer>
		<charFilter class="solr.HTMLStripCharFilterFactory"/> <!-- strip HTML -->
		<tokenizer class="solr.StandardTokenizerFactory"/>
		<filter class="solr.StandardFilterFactory"/>
		<filter class="solr.LowerCaseFilterFactory"/>
		<filter class="solr.ASCIIFoldingFilterFactory"/> <!-- convert accented chars to ASCII -->
	</analyzer>
</fieldType>



A more difficult problem was to sort accented string fields. By default, the class of the sunspot’s string field is solr.StrField. The string field is used for sorting, but was showing problems to sort accented inputs. For the inputs “├írvore”, “bola”, “ano” it was showing the sorted result: “ano”, “bola”, “├írvore” (the correct result would be “ano”, “├írvore”, “bola”).

The problem with accented chars sorting is that non-ASCII chars are represented as HTML entites (for example, &aacute; instead of ├í) and special chars as “&” goes after alphanumeric chars in a sort.

To solve this, we changed the string field’s class to solr.TextField but making sure that its tokenizer would not create more than one token for each entry. The tokenizer we used was the KeywordTokenizer. The final schema for the string field was:



1
2
3
4
5
6
7
<fieldType name="string" class="solr.TextField" omitNorms="true">
	<analyzer>
		<tokenizer class="solr.KeywordTokenizerFactory"/>
		<filter class="solr.LowerCaseFilterFactory"/>
		<filter class="solr.ASCIIFoldingFilterFactory"/>
	</analyzer>
</fieldType>



VN:F [1.9.11_1134]
Rating: 5.0/5 (3 votes cast)
Apache Solr: how to search sorting accented fields, 5.0 out of 5 based on 3 ratings
Related Posts with Thumbnails

Rafael Barbolo
Rafael Barbolo

Engenheiro de Computa├ž├úo e administrador do Bit a Bit. Empreendedor desde 2007. ├ë s├│cio e cofundador do Kauplus, plataforma de e-commerce que oferece inovadoras experi├¬ncias de compras online.

Tags: , , , , , , , , , , , , , , , , , , ,

6 Coment├írios para “Apache Solr: how to search sorting accented fields”

  1. Eduardo Russo

    Porra BarbsÔÇŽ em ingl├¬s?

    VA:F [1.9.11_1134]
    Rating: 0 (from 0 votes)
    #439
  2. Eduardo Russo

    Tá chique demais esse Bit a Bit!!!

    VA:F [1.9.11_1134]
    Rating: 0 (from 0 votes)
    #440
  3. Ernane

    Olá,

    Fiz essa altera├ž├úo por├ęm agora meus facetes est├úo ficando sem acento “lan├žamento” fica lancamento, a busca funciona perfeitamente mas os facetes fica assim sabe como corrigir isso ?

    desde agrade├žo

    VA:F [1.9.11_1134]
    Rating: 0 (from 0 votes)
    #1104

Deixe um Comentário

Spam Protection by WP-SpamFree

Get Adobe Flash playerPlugin by wpburn.com wordpress themes