IdsExportPlugin
A plugin (to be more precise: set of plugins) for Solr allowing time-efficient export of Ids of all found documents (or any DocValues-enabled field values) in comma-separated format without sorting. Lack of result sorting results in significantly better performance then Solr build-in /export
endpoint.
Note: the plugin is developed and tested on standalone Solr instance, without any promises nor guarantees about Solr Cloud.
Requirements
- Solr version > 7.2 (tested with 7.2.1)
- Solr running in standalone mode (Solr Cloud not supported)
Motivation
The initial motivation for creating this plugin was ability to produce output, which could be used as a direct input for Terms Query Parser in another Solr request. Example:
First, search Car Brands index and give me IDs of all brands, which sell in Poland
http://localhost:8080/car_brands/select?availability:pl&fq={!ids field=brand_id}&wt=ids Output: vw,opel,audi
Then, search Car Models index and give me models with electric engine:
http://localhost:8080/car_models/select?engine:electric&fq={!terms f=brand_id}vw,opel,audi
Other possible use cases include:
- simplifying batch jobs which do some calculation based on a full result set and doesn't require any document order (f.ex. recalculate popularity for all product from Poland every day) - removes necessity of paging
- creating reports - finding all documents matching criteria
- replacing
/export
endpoint when sorting is not required
Basic concepts
IdsExportPlugin consists of:
IdsExportFilter
IdsExportSearchComponent
IdsExportResponseWriter
The idea behind IdsExportPlugin is to use a post-filter (IdsExportFilter
) as the last filter during the request processing phase, which will collect all found Document Ids in an optimized data structure. Then IdsExportSearchComponent
will write those Ids to the response, and IdsExportResponseWriter
will output them in comma-separated format.
IdsExportFilter
IdsExportFilter
is a Solr post-filter. In Solr terminology, a filter is a piece of code which decides, whether the document matches search criteria and should be included in the response. A post-filter will be executed after regular filters, thanks to this it works on limited set of documents, already filtered by previous filters.
IdsExportFilter
implements the post-filter interface, but doesn't really decides if a document matches search criteria or not - it accepts all documents - but instead it collects certain field values from documents, and stores them in a data structure. The field name is defined in the request URL or configuration.
This filter was initially designed to read values of the documents' unique key, but in fact it can read values of any field, which has DocValues enabled. In this document we will refer to those values as Ids
.
Internally, Ids are stored in a data structure:
- in case of fields with Numeric or Sorted Numeric DocValues, Ids (which are longs) are stored inside
com.carrotsearch.hppc.LongArrayList
(data structure based on array of primitive longs) - in case of fields with Binary, Sorted or SortedSet DocValues, Ids (which are Strings) are stored as an ArrayList of
org.apache.lucene.util.BytesRef
(lucene-optimized type for string binary values, mainly used for Strings)
Ids don't need to be unique - in case of repeated values, it will be stored a couple of times.
IdsExportSearchComponent
IdsExportSearchComponent
is a search component (piece of code which executes after request processing, but before sending the response) which simply adds the collected Ids to the Solr response under the key defined in the configuration. After this operation, response will contain additional list of Ids of all documents.
IdsExportResponseWriter
The last component, IdsExportResponseWriter
, transforms the Solr response into comma-separated list of Ids. All additional response elements are skipped. The MIME type of the response is set to text/plain
, encoding set to UTF-8
.
Note: usage of IdsExportResponseWriter
is optional. If you don't want a comma-separated format and you're fine with standard Solr JSON/XML/etc. response - then you don't have to use IdsExportResponseWriter
.
Installation
-
Add JAR file to Solr's classpath (https://lucene.apache.org/solr/guide/7_2/lib-directives-in-solrconfig.html)
-
Add to
solrconfig.xml
following code<queryParser name="ids" class="pl.allegro.search.solr.ids.filter.IdsExportFilterParserPlugin"> <int name="bufferInitialSize">100000</int> </queryParser> <searchComponent name="ids" class="pl.allegro.search.solr.ids.searchcomponent.IdsExportSearchComponent"> <str name="responseKey">ids</str> </searchComponent> <queryResponseWriter name="ids" class="pl.allegro.search.solr.ids.responsewriter.IdsExportResponseWriter"> <str name="responseKey">ids</str> </queryResponseWriter>
The exact meaning of configuration parameters is described in Configuration
Each of those components may be registered under any valid name.
- The name of the
IdsExportFilterParserPlugin
(which is a factory forIdsExportFilter
) will be reflected in Solr URL (you will use it in requests to activate the plugin) - please give it some reasonable name. In this document we will assume the nameids
http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}
- We strongly recommend to give
IdsExportSearchComponent
the same name as inIdsExportFilterParserPlugin
for simplicity. - The name of the
IdsExportResponseWriter
will be reflected in Solr URL (you will use it to change the output format) - please give it some reasonable name. We recommend the same name as inIdsExportFilterParserPlugin
for simplicity. In this document we will assume the nameids
http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}&wt=ids
- The name of the
Usage examples
http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}&wt=ids
# Will output a comma-separated values of `product_id` field from all documents in the index.
Example response:
1,2,3,4,5,6
http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}&rows=2
# Will output a list of values of `product_id` field as an additional Solr's response attribute.
Example response:
{
"responseHeader": {
"status": 0,
"QTime": 2,
"params": {
"q": "*:*",
"fq": "{!ids field=product_id}",
"rows": "2"
}
},
"response": {
"numFound":6,
"start": 0,
"docs": [
{
"product_name": "Test 0",
"product_id": "0"
},
{
"product_name": "Test 1",
"product_id": "1"
}
]
},
"ids": [
"0",
"1",
"2",
"3",
"4",
"5",
"6"
]
}
# Note: ids doesn't respect rows/start parameters - will always output everything found.
Configuration
IdsExportFilterParserPlugin
configuration options available in solrconfig.xml
:
bufferInitialSize
- initial size (in number of items) of the buffer for storing Ids. It should be a bit bigger than estimated average response size. Generally every number will work, however:- if set too low, the buffer will be extended a couple of times during request processing, resulting in increased CPU and memory consumption
- if set too high, you will unnecessarily allocate a lot of memory Default value: 100 000.
defaultIndexField
- name of the field, where Ids are stored. This can be configured also on a per-request basics via URL parameterfield
, however in case of missing URL parameter the default configured here will be used. Default value: doc_id.
IdsExportFilter
configuration options available in URL:
field
- name of the field, where Ids are stored. Default value: configured indefaultIndexField
insolrconfig.xml
http://localhost:8080/solr/core_name/select?q=*:*&fq={!ids field=product_id}
IdsExportSearchComponent
configuration options available in solrconfig.xml
:
responseKey
- key in the Solr response where Ids should be stored. Default value: ids.
IdsExportResponseWriter
configuration options available in solrconfig.xml
:
responseKey
- key in the Solr response where Ids are stored. The final Solr output will contains only comma-separated values from this field. Default value: ids.separator
- a separator (char or String) used to separate values in the final Solr output. In this document we will assume it is a comma, therefore we have used phrase "comma-separated" a couple of times, however it's possible to change it. Default value: , (comma)
Performance
Single query time comparison
In this test scenario, a single Solr instance was processing only a single request at once. Each request was sent three times to Solr:
- To
/select
endpoint, withrows=0
, andIdsExportPlugin
enabled - To
/export
endpoint, with sorting set toIds
(sorting was obligatory) - To
/select
endpoint, withrows
set to expected size od result set and sorting set toIds
, withoutIdsExportPlugin
Note: given times are the total request time, including sending HTTP request, searching and downloading HTTP response. Technically, times were measured using linux time
command, which measured execution time of curl
with a given query. Although this approach is not a "clean" benchmark of the plugin itself, it also takes into account the overhead required to download a potentially large response - and this also favors IdsExportPlugin, due to the very concise format of the output data - but it is also the closest to the actual use cases of the plugin.
Results (times in seconds):
numFound | IdsExportPlugin | /export | /select |
---|---|---|---|
2 | 0.036 | 0.012 | 0.008 |
1082 | 0.012 | 0.128 | 0.136 |
12957 | 0.02 | 1.956 | 1.949 |
225816 | 0.149 | 55.105 | 59.068 |
1841320 | 0.681 | 393.532 | 396.918 |
5971685 | 2.232 | 831.853 | 822.736 |
Multi-threaded performance
In this test scenario, a single Solr instance was processing requests incoming via multiple connections concurrently. Each request was sent to two endpoints:
- To
/select
endpoint, withrows=0
, andIdsExportPlugin
enabled - To
/export
endpoint, with sorting set toIds
(sorting was obligatory)
The test scenario has been divided into three test cases. In each test case a set of unique phrases has been used, selected to give the expected number of results:
- between 20 000 and 50 000 (phrases giving "small" result sets)
- between 50 000 and 280 000 (phrases giving "medium sized" result sets)
- between 280 000 and 3 100 000 (phrases giving "large" result sets)
This test scenario was carried out using Apache JMeter. All results presented below come from JMeter results.
Results:
concurrent connections | requests per connection | total request count | result set size per request | IdsExportPlugin RPS | IdsExportPlugin avg | IdsExportPlugin Max | /export RPS | /export avg | /export max |
---|---|---|---|---|---|---|---|---|---|
30 | 80 | 2400 | 20000-50000 | 489.50 rps | 47.00 ms | 190.00 ms | 3.00 rps | 9414.00 ms | 26972.00 ms |
30 | 80 | 2400 | 50000-280000 | 199.10 rps | 127.00 ms | 325.00 ms | 0.80 rps | 35663.00 ms | 126313.00 ms |
30 | 22 | 660 | 280000-3100000 | 30.00 rps | 796.00 ms | 2669.00 ms | 0.10 rps | 230305.00 ms | 812294.00 ms |
Performance - summary
The presented results clearly show that the use of IdsExportPlugin
highly speeds up Ids export from Solr - response time and throughput may be a couple of hundred times better than in case of Solr built-in /export
or /select
endpoints.
The largest performance killer /export
and /select
is result set sorting. IdsExportPlugin
does not perform any sorting, just outputs all found Ids in order they are processed by Solr.
Memory consumption
Memory consumption of IdsExportPlugin
is not higher then memory consumption of the standard /export
endpoint.
On the one hand, IdsExportPlugin
require a data structure which size is proportional to the amount of found documents, so the bigger result sets are found, the more memory is required for processing.
On the other hand, standard /export
endpoint also require some data structure with size proportional to the result set size for sorting purposes. Therefore the overall memory footprint of IdsExportPlugin
will not be higher then /export
's.
Pro tip:
Generally it's best to use IdsExportPlugin
with fields, which have DocValues of type Numeric
or SortedNumeric
- in this case the data structure is com.carrotsearch.hppc.LongArrayList
, which internally relies on array of primitive longs.
All other field types will store it's Ids inside ArrayList of org.apache.lucene.util.BytesRef
- an optimized way of storing Strings.
Build
./gradlew clean build
License
This software is published under Apache License 2.0.