IdsExportPlugin
A plugin (to be more precise: set of plugins) for Solr allowing time-efficient export of Ids of all found documents (or any DocValues-enabled field values) in comma-separated format without sorting. Lack of result sorting results in significantly better performance then Solr build-in /export endpoint.
Note: the plugin is developed and tested on standalone Solr instance, without any promises nor guarantees about Solr Cloud.
Requirements
- Solr version > 7.2 (tested with 7.2.1)
- Solr running in standalone mode (Solr Cloud not supported)
Motivation
The initial motivation for creating this plugin was ability to produce output, which could be used as a direct input for Terms Query Parser in another Solr request. Example:
First, search Car Brands index and give me IDs of all brands, which sell in Poland
http://localhost:8080/car_brands/select?availability:pl&fq={!ids field=brand_id}&wt=ids Output: vw,opel,audiThen, search Car Models index and give me models with electric engine:
http://localhost:8080/car_models/select?engine:electric&fq={!terms f=brand_id}vw,opel,audi
Other possible use cases include:
- simplifying batch jobs which do some calculation based on a full result set and doesn't require any document order (f.ex. recalculate popularity for all product from Poland every day) - removes necessity of paging
- creating reports - finding all documents matching criteria
- replacing
/exportendpoint when sorting is not required
Basic concepts
IdsExportPlugin consists of:
IdsExportFilterIdsExportSearchComponentIdsExportResponseWriter
The idea behind IdsExportPlugin is to use a post-filter (IdsExportFilter) as the last filter during the request processing phase, which will collect all found Document Ids in an optimized data structure. Then IdsExportSearchComponent will write those Ids to the response, and IdsExportResponseWriter will output them in comma-separated format.
IdsExportFilter
IdsExportFilter is a Solr post-filter. In Solr terminology, a filter is a piece of code which decides, whether the document matches search criteria and should be included in the response. A post-filter will be executed after regular filters, thanks to this it works on limited set of documents, already filtered by previous filters.
IdsExportFilter implements the post-filter interface, but doesn't really decides if a document matches search criteria or not - it accepts all documents - but instead it collects certain field values from documents, and stores them in a data structure. The field name is defined in the request URL or configuration.
This filter was initially designed to read values of the documents' unique key, but in fact it can read values of any field, which has DocValues enabled. In this document we will refer to those values as Ids.
Internally, Ids are stored in a data structure:
- in case of fields with Numeric or Sorted Numeric DocValues, Ids (which are longs) are stored inside
com.carrotsearch.hppc.LongArrayList(data structure based on array of primitive longs) - in case of fields with Binary, Sorted or SortedSet DocValues, Ids (which are Strings) are stored as an ArrayList of
org.apache.lucene.util.BytesRef(lucene-optimized type for string binary values, mainly used for Strings)
Ids don't need to be unique - in case of repeated values, it will be stored a couple of times.
IdsExportSearchComponent
IdsExportSearchComponent is a search component (piece of code which executes after request processing, but before sending the response) which simply adds the collected Ids to the Solr response under the key defined in the configuration. After this operation, response will contain additional list of Ids of all documents.
IdsExportResponseWriter
The last component, IdsExportResponseWriter, transforms the Solr response into comma-separated list of Ids. All additional response elements are skipped. The MIME type of the response is set to text/plain, encoding set to UTF-8.
Note: usage of IdsExportResponseWriter is optional. If you don't want a comma-separated format and you're fine with standard Solr JSON/XML/etc. response - then you don't have to use IdsExportResponseWriter.
Installation
-
Add JAR file to Solr's classpath (https://lucene.apache.org/solr/guide/7_2/lib-directives-in-solrconfig.html)
-
Add to
solrconfig.xmlfollowing code<queryParser name="ids" class="pl.allegro.search.solr.ids.filter.IdsExportFilterParserPlugin"> <int name="bufferInitialSize">100000</int> </queryParser> <searchComponent name="ids" class="pl.allegro.search.solr.ids.searchcomponent.IdsExportSearchComponent"> <str name="responseKey">ids</str> </searchComponent> <queryResponseWriter name="ids" class="pl.allegro.search.solr.ids.responsewriter.IdsExportResponseWriter"> <str name="responseKey">ids</str> </queryResponseWriter>
The exact meaning of configuration parameters is described in Configuration
Each of those components may be registered under any valid name.
- The name of the
IdsExportFilterParserPlugin(which is a factory forIdsExportFilter) will be reflected in Solr URL (you will use it in requests to activate the plugin) - please give it some reasonable name. In this document we will assume the nameidshttp://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id} - We strongly recommend to give
IdsExportSearchComponentthe same name as inIdsExportFilterParserPluginfor simplicity. - The name of the
IdsExportResponseWriterwill be reflected in Solr URL (you will use it to change the output format) - please give it some reasonable name. We recommend the same name as inIdsExportFilterParserPluginfor simplicity. In this document we will assume the nameidshttp://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}&wt=ids
- The name of the
Usage examples
http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}&wt=ids
# Will output a comma-separated values of `product_id` field from all documents in the index.
Example response:
1,2,3,4,5,6
http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}&rows=2
# Will output a list of values of `product_id` field as an additional Solr's response attribute.
Example response:
{
"responseHeader": {
"status": 0,
"QTime": 2,
"params": {
"q": "*:*",
"fq": "{!ids field=product_id}",
"rows": "2"
}
},
"response": {
"numFound":6,
"start": 0,
"docs": [
{
"product_name": "Test 0",
"product_id": "0"
},
{
"product_name": "Test 1",
"product_id": "1"
}
]
},
"ids": [
"0",
"1",
"2",
"3",
"4",
"5",
"6"
]
}
# Note: ids doesn't respect rows/start parameters - will always output everything found.
Configuration
IdsExportFilterParserPlugin configuration options available in solrconfig.xml:
bufferInitialSize- initial size (in number of items) of the buffer for storing Ids. It should be a bit bigger than estimated average response size. Generally every number will work, however:- if set too low, the buffer will be extended a couple of times during request processing, resulting in increased CPU and memory consumption
- if set too high, you will unnecessarily allocate a lot of memory Default value: 100 000.
defaultIndexField- name of the field, where Ids are stored. This can be configured also on a per-request basics via URL parameterfield, however in case of missing URL parameter the default configured here will be used. Default value: doc_id.
IdsExportFilter configuration options available in URL:
field- name of the field, where Ids are stored. Default value: configured indefaultIndexFieldinsolrconfig.xmlhttp://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}
IdsExportSearchComponent configuration options available in solrconfig.xml:
responseKey- key in the Solr response where Ids should be stored. Default value: ids.
IdsExportResponseWriter configuration options available in solrconfig.xml:
responseKey- key in the Solr response where Ids are stored. The final Solr output will contains only comma-separated values from this field. Default value: ids.separator- a separator (char or String) used to separate values in the final Solr output. In this document we will assume it is a comma, therefore we have used phrase "comma-separated" a couple of times, however it's possible to change it. Default value: , (comma)
Performance
Single query time comparison
In this test scenario, a single Solr instance was processing only a single request at once. Each request was sent three times to Solr:
- To
/selectendpoint, withrows=0, andIdsExportPluginenabled - To
/exportendpoint, with sorting set toIds(sorting was obligatory) - To
/selectendpoint, withrowsset to expected size od result set and sorting set toIds, withoutIdsExportPlugin
Note: given times are the total request time, including sending HTTP request, searching and downloading HTTP response. Technically, times were measured using linux time command, which measured execution time of curl with a given query. Although this approach is not a "clean" benchmark of the plugin itself, it also takes into account the overhead required to download a potentially large response - and this also favors IdsExportPlugin, due to the very concise format of the output data - but it is also the closest to the actual use cases of the plugin.
Results (times in seconds):
| numFound | IdsExportPlugin | /export | /select |
|---|---|---|---|
| 2 | 0.036 | 0.012 | 0.008 |
| 1082 | 0.012 | 0.128 | 0.136 |
| 12957 | 0.02 | 1.956 | 1.949 |
| 225816 | 0.149 | 55.105 | 59.068 |
| 1841320 | 0.681 | 393.532 | 396.918 |
| 5971685 | 2.232 | 831.853 | 822.736 |
Multi-threaded performance
In this test scenario, a single Solr instance was processing requests incoming via multiple connections concurrently. Each request was sent to two endpoints:
- To
/selectendpoint, withrows=0, andIdsExportPluginenabled - To
/exportendpoint, with sorting set toIds(sorting was obligatory)
The test scenario has been divided into three test cases. In each test case a set of unique phrases has been used, selected to give the expected number of results:
- between 20 000 and 50 000 (phrases giving "small" result sets)
- between 50 000 and 280 000 (phrases giving "medium sized" result sets)
- between 280 000 and 3 100 000 (phrases giving "large" result sets)
This test scenario was carried out using Apache JMeter. All results presented below come from JMeter results.
Results:
| concurrent connections | requests per connection | total request count | result set size per request | IdsExportPlugin RPS | IdsExportPlugin avg | IdsExportPlugin Max | /export RPS | /export avg | /export max |
|---|---|---|---|---|---|---|---|---|---|
| 30 | 80 | 2400 | 20000-50000 | 489.50 rps | 47.00 ms | 190.00 ms | 3.00 rps | 9414.00 ms | 26972.00 ms |
| 30 | 80 | 2400 | 50000-280000 | 199.10 rps | 127.00 ms | 325.00 ms | 0.80 rps | 35663.00 ms | 126313.00 ms |
| 30 | 22 | 660 | 280000-3100000 | 30.00 rps | 796.00 ms | 2669.00 ms | 0.10 rps | 230305.00 ms | 812294.00 ms |
Performance - summary
The presented results clearly show that the use of IdsExportPlugin highly speeds up Ids export from Solr - response time and throughput may be a couple of hundred times better than in case of Solr built-in /export or /select endpoints.
The largest performance killer /export and /select is result set sorting. IdsExportPlugin does not perform any sorting, just outputs all found Ids in order they are processed by Solr.
Memory consumption
Memory consumption of IdsExportPlugin is not higher then memory consumption of the standard /export endpoint.
On the one hand, IdsExportPlugin require a data structure which size is proportional to the amount of found documents, so the bigger result sets are found, the more memory is required for processing.
On the other hand, standard /export endpoint also require some data structure with size proportional to the result set size for sorting purposes. Therefore the overall memory footprint of IdsExportPlugin will not be higher then /export's.
Pro tip:
Generally it's best to use IdsExportPlugin with fields, which have DocValues of type Numeric or SortedNumeric - in this case the data structure is com.carrotsearch.hppc.LongArrayList, which internally relies on array of primitive longs.
All other field types will store it's Ids inside ArrayList of org.apache.lucene.util.BytesRef - an optimized way of storing Strings.
Build
./gradlew clean build
License
This software is published under Apache License 2.0.