`
BradyZhu
  • 浏览: 245516 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

【总结】搜索服务Solr

 
阅读更多
1, Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via JSON, XML, CSV or binary over HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary results.
2, Solr Administration User Interface
  • Logging
  • Cloud Screens
  • Core Admin
  • Java Properties
  • Thread Dump
  • Core-Specific Tools
    • Analysis Screen
    • Dataimport screen
    • Documents Screen
    • Files Screen
    • Ping
    • Plugin & Stats Screen
    • Query Screen
    • Replication Screen
    • Schema Browser Screen
    • Segments Info
3, Documents, Fields, Schema Design
  • Field Properties
    • indexed
    • stored
    • docValues
    • sortMissingFirst / sortMissingLast
    • multiValued
    • omitNorms
    • omitTermFreqAndPositions
    • omitPositions
    • termVectors / termPositions / termOffsets / termPayloads
    • required
  • Field Types
    • BinaryField
    • BoolField
    • CollationField
    • CurrencyField
    • DataRangeField
    • ExternalFileField
    • EnumField
    • LatLonType
    • PointType
    • TextField
    • StrField
    • TrieField
    • TrieInt/Long/FloatField
    • UUIDField
4, Analyzers, Tokenizers and Filters
  • Analyzers
    • An analyzer examines the text of fields and generates a token stream
    • Analyzers are specified as a child of the <fieldType> element in the schema.xml configuration file
    • Analyzers
      • WhitespaceAnalyzer
      • SimpleAnalyzer
      • StopAnalyzer
      • StandardAnalyzer
  • Tokenizers
    • The job of a tokenizer is to break up a stream of text into tokens, where each token is (usually) a sub-sequence of the characters in the text
    • An analyzer is aware of the field it is configured for, but a tokenizer is not
    • Tokenizers read from a character stream (a Reader) and produce a sequence of Token objects (a TokenStream)
    • You configure the tokenizer for a text field type in schema.xml with a <tokenizer> element, as a child of <analyzer>
    • Tokenizers
      • WhitespaceTokenizer
      • KeywordTokenizer
      • LetterTokenizer
      • StandardTokenizer
  • Filters
    • Like tokenizers, filters consume input and produce a stream of tokens
    • Filters also derive from org.apache.lucene.analysis.TokenStream
    • Unlike tokenizers, a filter's input is another TokenStream. The job of a filter is usually easier than that of a tokenizer since in most cases a filter looks at each token in the stream sequentially and decides whether to pass it along, replace it or discard it
    • A filter may also do more complex analysis by looking ahead to consider multiple tokens at once, although this is less common
    • One hypothetical use for such a filter might be to normalize state names that would be tokenized as two words. For example, the single token "california" would be replaced with "CA", while the token pair "rhode" followed by "island" would become the single token "RI"
    • Filters
      • LowerCaseFilter
      • StopFilter
      • PorterStemFilter
      • ASCIIFoldingFilter
      • StandardFilter
5, Indexing
  • The three most common ways of loading data into a Solr index
    • Using the Solr Cell framework built on Apache Tika for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats
    • Uploading XML files by sending HTTP requests to the Solr server from any environment where such requests can be generated
    • Writing a custom Java application to ingest data through Solr's Java Client API
  • Uploading Data with Index Handlers
    • Index Handlers are Request Handlers designed to add, delete and update documents to the index
    • In addition to having plugins for importing rich documents using Tika or from structured data sources using the Data Import Handler
    • Solr natively supports indexing structured documents in XML, CSV and JSON
  • Uploading Data with Solr Cell using Apache Tika
    • Solr uses code from the Apache Tika project to provide a framework for incorporating many different file-format parsers such as Apache PDFBox and Apache POI into Solr itself
    • Working with this framework, Solr's ExtractingRequestHandler can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing
    • When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that abbreviation came this framework's name: Solr Cell
  • Uploading Structured Data Store Data with the Data Import Handler
    • The Data Import Handler (DIH) provides a mechanism for importing content from a data store and indexing it
    • In addition to relational databases, DIH can index content from HTTP based data sources such as RSS and ATOM feeds, e-mail repositories, and structured XML where an XPath processor is used to generate fields
  • Detecting Languages During Indexing
    • Solr can identify languages and map text to language-specific fields during indexing using the langid UpdateRequestProcessor. Solr supports two implementations of this feature
      • Tika's language detection feature
      • LangDetect language detection
  • UIMA Integration
    • UIMA(the Apache Unstructured Information Management Architecture) lets you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations
6, Searching
  • The search query is processed by a request handler
    • Solr supports a variety of request handlers. Some are designed for processing search queries, while others manage tasks such as index replication
    • To process a search query, a request handler calls a query parser, which interprets the terms and parameters of a query
    • Input to a query parser can include
      • search strings---that is, terms to search for in the index
      • parameters for fine-tuning the query by increasing the importance of particular strings or fields, by applying Boolean logic among the search terms, or by excluding content from the search results
      • parameters for controlling the presentation of the query response, such as specifying the order in which results are to be presented or limiting the response to particular fields of the search application's schema
    • Search parameters may also specify a query filter
  • Query Syntax and Parsing
    • The Standard Query Parser
      • Solr's default Query Parser is also known as the "lucene" parser
      • The key advantage of the standard query parser is that it supports a robust and fairly intuitive syntax allowing you to create a variety of structured queries
      • The largest disadvantage is that it's very intolerant of syntax errors, as compared with something like the DisMax query parser which is designed to throw as few errors as possible
    • The DisMax Query Parser
      • The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field
      • Additional options enable users to influence the score based on rules specific to each use case independent of user input)
    • The Extended DisMax Query Parser
      • The Extended DisMax (eDisMax) query parser is an improved version of the DisMax query parser. In addition to supporting all the DisMax query parser parameters
    • Other Parsers
      • Block Join Query Parsers
      • Boost Query Parser
      • Collapsing Query Parser
      • Complex Phrase Query Parser
      • Field Query Parser
      • Function Query Parser
      • Function Range Query Parser
      • Join Query Parser
      • Lucene Query Parser
      • Max Score Query Parser
      • More Like This Query Parser
  • Query
    • TermQuery
    • TermRangeQuery
    • NumericRangeQuery
    • PrefixQuery
    • BooleanQuery
    • PhraseQuery
    • WildcardQuery
    • FuzzyQuery
    • MatchAllDocsQuery
  • Faceting
    • faceting is the arrangement of search results into categories based on indexed terms
    • Searchers are presented with the indexed terms, along with numerical counts of how many matching documents were found were each term
    • Faceting makes it easy for users to explore search results, narrowing in on exactly the results they are looking for
  • Highlighting
    • Highlighting in Solr allows fragments of documents that match the user's query to be included with the query response
    • There are three highlighting implementations available
      • The Standard Highlighter is the swiss-army knife of the highlighters. It has the most sophisticated and fine-grained query representation of the three highlighters
      • FastVector Highlighter
        • The FastVector Highlighter requires term vector options (termVectors, termP ositions, and termOffsets) on the field, and is optimized with that in mind
        • It tends to work better for more languages than the Standard Highlighter, because it supports Unicode breakiterators. On the other hand, its query-representation is less advanced than the Standard Highlighter
        • for example it will not work well with the surround parser. This highlighter is a good choice for large documents and highlighting text in a variety of languages
      • Postings Highlighter
        • The Postings Highlighter requires storeOffsetsWithPositions to be configured on the field
        • This is a much more compact and efficient structure than term vectors, but is not appropriate for huge numbers of query terms (e.g. wildcard queries). Like the FastVector Highlighter
        • it supports Unicode algorithms for dividing up the document
  • Spell Checking
  • Query Re-Ranking
  • Suggester
  • MoreLikeThis
  • Pagination of Results
  • Result Grouping
  • Spatial Search
  • The Term Vector Component: For each document in the response, the TermVectorCcomponent can return the term vector, the term frequency, inverse document frequency, position, and offset information
  • The Stats Component: The Stats component returns simple statistics for numeric, string, and date fields within the document set
  • Response Writers
    • CSVResponseWriter
    • JSONResponseWriter
    • VelocityResponseWriter
    • XMLResponseWriter
7, The Well-Configured Solr Instance
  • Configuring solrconfig.xml
    • request handlers, which process the requests to Solr, such as requests to add documents to the index or requests to return results for a query
    • listeners, processes that "listen" for particular query-related events; listeners can be used to trigger the
    • execution of special code, such as invoking some common queries to warm-up caches
    • the Request Dispatcher for managing HTTP communications
    • the Admin Web interface
    • parameters related to replication and duplication (these parameters are covered in detail in Legacy Scaling and Distribution)
  • Solr Cores and solr.xml
    • Solr.xml已经从配置一个Solr core进化到支持多个Solr core,并最终为SolrCloud定义参数
8, SolrCloud
  • 概念
    • Collection:在SolrCloud集群中逻辑意义上的完整的索引。它常常被划分为一个或多个Shard,它们使用相同的Config Set。如果Shard数超过一个,它就是分布式索引,SolrCloud让你通过Collection名称引用它,而不需要关心分布式检索时需要使用的和Shard相关参数
    • Config Set: Solr Core提供服务必须的一组配置文件。每个config set有一个名字。最小需要包括solrconfig.xml (SolrConfigXml)和schema.xml (SchemaXml),除此之外,依据这两个文件的配置内容,可能还需要包含其它文件。它存储在Zookeeper中。Config sets可以重新上传或者使用upconfig命令更新,使用Solr的启动参数bootstrap_confdir指定可以初始化或更新它
    • Core: 也就是Solr Core,一个Solr中包含一个或者多个Solr Core,每个Solr Core可以独立提供索引和查询功能每个Solr Core对应一个索引或者Collection的Shard,Solr Core的提出是为了增加管理灵活性和共用资源。在SolrCloud中有个不同点是它使用的配置是在Zookeeper中的,传统的Solr core的配置文件是在磁盘上的配置目录中
    • Leader: 赢得选举的Shard replicas。每个Shard有多个Replicas,这几个Replicas需要选举来确定一个Leader。选举可以发生在任何时间,但是通常他们仅在某个Solr实例发生故障时才会触发。当索引documents时,SolrCloud会传递它们到此Shard对应的leader,leader再分发它们到全部Shard的replicas
    • Replica: Shard的一个拷贝。每个Replica存在于Solr的一个Core中。一个命名为“test”的collection以numShards=1创建,并且指定replicationFactor设置为2,这会产生2个replicas,也就是对应会有2个Core,每个在不同的机器或者Solr实例。一个会被命名为test_shard1_replica1,另一个命名为test_shard1_replica2。它们中的一个会被选举为Leader
    • Shard: Collection的逻辑分片。每个Shard被化成一个或者多个replicas,通过选举确定哪个是Leader
    • Zookeeper: Zookeeper提供分布式锁功能,对SolrCloud是必须的。它处理Leader选举。Solr可以以内嵌的Zookeeper运行,但是建议用独立的,并且最好有3个以上的主机
  • Features
    • Central configuration for the entire cluster
    • Automatic load balancing and fail-over for queries
    • ZooKeeper integration for cluster coordination and configuration
    • Nodes and Cores
      • In SolrCloud, anodeis Java Virtual Machine instance running Solr, commonly called a server. Each Solr core can also be considered a node. Any node can contain both an instance of Solr and various kinds of data
      • A Solrcoreis basically an index of the text and fields found in documents
      • A single Solr instance can contain multiple "cores", which are separate from each other based on local criteria
      • When you start a new core in SolrCloud mode, it registers itself with ZooKeeper. This involves creating an Ephemeral node that will go away if the Solr instance goes down, as well as registering information about the core and how to contact it
    • Clusters
      • A cluster is set of Solr nodes managed by ZooKeeper as a single unit
      • When you have a cluster, you can always make requests to the cluster and if the request is acknowledged, you can be sure that it will be managed as a unit and be durable, i.e., you won't lose data. Updates can be seen right after they are made and the cluster can be expanded or contracted
    • Leaders and Replicas
      • The concept of aleaderis similar to that ofmasterwhen thinking of traditional Solr replication. The leader is responsible for making sure thereplicasare up to date with the same information stored in the leader
      • However, with SolrCloud, you don't simply have one master and one or more "slaves", instead you likely have distributed your search and index traffic to multiple machines
    • When your data is too large for one node, you can break it up and store it in sections by creating one or moreshards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the index
    • A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard for data that represents each state, or different categories that are likely to be searched independently, but are often combined
    • Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple shards, so the query was executed against the entire Solr index and no documents would be missed from the search results
    • ZooKeeper provides failover and load balancing
    • One of the advantages of using SolrCloud is the ability to distribute requests among various shards that may or may not contain the data that you're looking for. You have the option of searching over all of your data or just parts of it
    • Configuring the ShardHandlerFactory
      • You can directly configure aspects of the concurrency and thread-pooling used within distributed search in Solr. This allows for finer grained control and you can tune it to target your own specific requirements. The default configuration favors throughput over latency
9, 中文分词器
  • mmseg4j
    • mmseg4j用Chih-Hao Tsai 的MMSeg算法实现的中文分词器
    • MMSeg 算法有两种分词方法:Simple和Complex,都是基于正向最大匹配。Complex加了四个规则过虑
  • paoding
    • Paoding's Knives 中文分词具有极 高效率 和 高扩展性 。引入隐喻,采用完全的面向对象设计,构思先进
    • 高效率:在PIII 1G内存个人机器上,1秒 可准确分词 100万 汉字
    • 采用基于不限制个数 的词典文件对文章进行有效切分,使能够将对词汇分类定义
    • 能够对未知的词汇进行合理解析
  • ictclas4j
    • ictclas4j中文分词系统是sinboy在中科院张华平和刘群老师的研制的FreeICTCLAS的基础上完成的一个java开源分词项目
  • IKAnalyzer
    • 它是以开源项目Lucene为应用主体的,结合词典分词和文法分析算法的中文分词组件
    • 采用了特有的“正向迭代最细粒度切分算法“,具有60万字/秒的高速处理能力
    • 采用了多子处理器分析模式,支持:英文字母(IP地址、Email、URL)、数字(日期,常用中文数量词,罗马数字,科学计数法),中文词汇(姓名、地名处理)等分词处理
    • 对中英联合支持不是很好,在这方面的处理比较麻烦.需再做一次查询,同时是支持个人词条的优化的词典存储,更小的内存占用
    • 支持用户词典扩展定义
    • 针对Lucene全文检索优化的查询分析器IKQueryParser;采用歧义分析算法优化查询关键字的搜索排列组合,能极大的提高Lucene检索的命中率
  • ansj
    • 这是一个ictclas的java实现.基本上重写了所有的数据结构和算法.词典是用的开源版的ictclas所提供的.并且进行了部分的人工优化
    • 内存中中文分词每秒钟大约100万字(速度上已经超越ictclas)
    • 文件读取分词每秒钟大约30万字
    • 准确率能达到96%以上
    • 目前实现了.中文分词. 中文姓名识别 . 用户自定义词典
    • 可以应用到自然语言处理等方面,适用于对分词效果要求高的各种项目
10,Solr 性能因素
  • Schema Design Considerations(数据模型方面考虑)
  • indexed fields
  • Configuration Considerations(配置方面考虑)
  • mergeFactor
分享到:
评论

相关推荐

    solr知识总结

    Solr 是一个开源的企业级搜索服务器,底层使用易于扩展和修改的Java 来实现。服务 器通信使用标准的HTTP 和XML,所以如果使用Solr 了解Java 技术会有用却不是必须的要 求。 Solr 主要特性有:强大的全文检索功能,...

    solr技术总结

    solr搜索

    搜索引擎solr环境配置、分词及索引操作

    这是我用window xp的自己按装,总结了,现在共享,希望给新手有帮助,“搜索引擎solr环境配置、分词及索引操作”

    solr 配置 以及建立索引

    在tomcat中配置solr,以及solr 全文搜索建立索引的相关方法总结

    Solr调研总结

    本文介绍solr的功能使用及相关注意事项;主要包括以下内容:环境搭建及调试;两个核心配置文件介绍;维护索引;查询索引,和在查询中可以应用的高亮显示、拼写检查、搜索建议、分组统计、拼音检索等功能的使用方法。

    solr学习心得和总结笔记

    1、站内搜索的技术选型 2、什么是solr 3、solr的安装及配置 Solr整合tomcat 4、使用solr维护索引 a)添加 b)删除 c)修改 5、使用solr查询索引 6、Solr的客户端SolrJ a)solrJ维护索引 b)SolrJ查询索引 7、综合案例

    solr 搜索引擎总结及相关安装教程

    有关solr搜索引擎的简介以及相关的安装教程,有助于新手的介入。。

    Solr学习实践总结.doc

    Solr 最初由 CNET Networks 开发,2006 年初,Apache Software Foundation 在 Lucene 顶级项目的支持下得到了 Solr。...Solr 现在是 Lucene(Apache 的基于 Java 的全文本搜索引擎库)的一个子项目。

    全文搜索引擎Solr与ElasticSearch入门至集群及项目实战(Solr+ES)

    13、Solr知识点总结 14、ElasticSearch下载安装(window以及linux下安装) 15、集群环境搭建 16、客户端Kibana安装与使用 17、集群管理插件head安装使用 18、java api 操作 ES 19、电商项目实战应用等等 ....

    Lucene4.6+Solr4.6实战开发垂直搜索引擎视频课程

    24.solr4.6搜索的相关参数功能(1) 25.solr4.6搜索的相关参数功能(2) 26.solr4.6自带zookeeper集群搭建 27.搜索框架搭建(1) 28.搜索框架搭建(2) 29.搜索框架搭建(3) 30.搜索框架搭建(4) 31.搜索框架搭建(5) 32.搜索...

    Lucene全文检索框架+Solr+ElasticSearch搜索引擎(Java高级必备.ES)

    12、Solr知识点总结 1、熟练掌握Lucene框架的使用,实现类似百度、京东商城等应用的全文检索效果; 2、ElasticSearch下载安装(window以及linux下安装) 3、集群环境搭建 4、客户端Kibana安装与使用 5、集群管理...

    解密搜索引擎技术实战-Lucene&java;精华版

    精华版(第3版)》总结搜索引擎相关理论与实际解决方案,并给出了Java实现,其中利用了流行的开源项目Lucene和Solr,而且还包括原创的实现。 《解密搜索引擎技术实战——Lucene&Java;精华版(第3版)》主要包括总体...

    自己动手写搜索引擎(罗刚著).doc

    8.1 使用Solr实现分布式搜索 232 8.1.1 Solr服务器端的配置与中文支持 232 8.1.2 把数据放进Solr 237 8.1.3 删除数据 240 8.1.4 客户端搜索界面 241 8.1.5 Solr索引库的查找 242 8.1.6 索引分发 246 8.1.7 Solr搜索...

    基于SpringBoot的下沉市场交易平台的设计与实现.docx

    此外还研究通过 Redis 来缓存数据库中商品数据以提升搜索查询的效率以及搭建 Solr 搜索服务器用以处理用户的搜索请求和利用消息中间件 ActiveMQ 来同步 Redis 缓存、Solr 索引库和通知静态网页的生成来减少系统资源...

    解密搜索引擎技术实战Java精华版

    本书主要包括总体介绍部分、爬虫部分、自然语言处理...本书还进一步介绍了实现准实时搜索的方法,展示了Solr1.4版本的用法以及实现分布式搜索服务集群的方法。最后介绍了在地理信息系统领域和户外活动搜索领域的应用。

    1.解密搜索引擎技术实战:Lucene&Java;精华版(第3版)

    本书总结搜索引擎相关理论与实际解决方案,并给出了Java实现,其中利用了流行的开源项目Lucene和Solr,而且还包括原创的实现。本书主要包括总体介绍部分、爬虫部分、自然语言处理部分、全文检索部分以及相关案例分析...

    JSP打造大型分布式B2C商城项目视频教程

    08.第八天(solr服务器搭建、搜索功能实现) 09.第九天(商品详情页面实现) 10.第十天(单点登录系统实现) 11.第十一天(购物车+订单) 12.第十二天(系统架构讲解、nginx) 13.第十三天(生产环境搭建、系统部署) 14.第十四...

    Eclipse开发分布式商城系统+完整视频代码及文档

    │ 14_搜索服务切换到集群版.wmv │ 15.课后作业-索引库商品同步.avi │ 15_商品同步作业.wmv │ 打开必读.txt │ 淘淘商城第八天笔记.docx │ ├─09.第九天 │ 01.第八天内容回顾.avi │ 02.课程计划.avi │ 03....

Global site tag (gtag.js) - Google Analytics