1000字范文 > Solr-Hbase 二级索引的实现

Solr-Hbase 二级索引的实现

时间：2018-11-08 02:40:36

相关推荐

Solr-Hbase 二级索引的实现

为什么HBase要建立二级索引

在HBase中检索数据通常有以下三种方式:

通过get, 指定RowKey获取唯一一条记录通过scan, 设置start和stop进行范围匹配全表扫描

所以我们发现, 想要精确且快速的定位在HBase表中某一条记录, 唯一的办法也就是通过RowKey进行查询。然而在多数情况, 需要从多个条件查询数据, 再依靠单一的Rowkey查询已经不满足需求。

方案

Hbase —–> Key Value Store —> Solr ——-> Web前端实时查询展示

Solr是一个高性能，采用Java5开发，基于Lucene的全文搜索服务器。同时对其进行了扩展，提供了比Lucene更为丰富的查询语言，同时实现了可配置、可扩展并对查询性能进行了优化，并且提供了一个完善的功能管理界面，是一款非常优秀的全文搜索引擎。

Key-Value Store Indexer是Hbase到Solr生成索引的中间工具。在CDH5中的Key-Value Store Indexer使用的是Lily HBase NRT Indexer服务

Lily HBase Indexer是一款灵活的、可扩展的、高容错的、事务性的，并且近实时的处理HBase列索引数据的分布式服务软件。它是NGDATA公司开发的Lily系统的一部分，已开放源代码。Lily HBase Indexer使用SolrCloud来存储HBase的索引数据，当HBase执行写入、更新或删除操作时，Indexer通过HBase的replication功能来把这些操作抽象成一系列的Event事件，并用来保证写入Solr中的HBase索引数据的一致性。并且Indexer支持用户自定义的抽取，转换规则来索引HBase列数据。Solr搜索结果会包含用户自定义的columnfamily:qualifier字段结果，这样应用程序就可以直接访问HBase的列数据。而且Indexer索引和搜索不会影响HBase运行的稳定性和HBase数据写入的吞吐量，因为索引和搜索过程是完全分开并且异步的。Lily HBase Indexer在CDH5中运行必须依赖HBase、SolrCloud和Zookeeper服务。

部署流程

首先创建测试表, 开启REPLICATION复制功能实现集群间的相互复制

create 'table',{NAME => 'test', REPLICATION_SCOPE => 1}# 1表示开启replication, 默认为0

对于已经存在的表

disable 'table'alter 'table',{NAME => 'test', REPLICATION_SCOPE => 1}enable 'table'

接下来在安装有Solr的主机上生成实体配置文件, 使用CDH自带的solrctl命令

solrctl instancedir --generate /opt/testindex/test# 路径可以自定义

在生成的目录 /opt/testindex/test/ 中, 修改conf/solrconfig.xml, 这个是硬提交会稍微影响性能

< autoCommit>< maxTime>${solr.autoCommit.maxTime:60000}< /maxTime>< openSearcher>true</ openSearcher></ autoCommit>

在conf/managed-schema文件中添加field

*注: 此步也可以省略, 推荐collection创建后在web管理页面中添加, 否则需要solr重启或field丢失

< field name=“testId” type=“string” indexed=“true” stored=“true” /># name是自定义的索引名后面要鱼Morphline.conf中的outputField属性对应# type是字段类型

上传配置文件到zookeeper

solrctl instancedir --create test /opt/testindex/test# --create后面跟自定义的名字# 路径是刚刚定义的配置路径 instancedir

创建collection

solrctl collection --create test# --create后面的名字必须和上一步的名字一致

collection创建后可以在8983/solr web管理面板上看到, 这时可以进入schema创建field

创建Lily HBase Indexer配置

在之前定义的instancedir路径下 /opt/testindex/test 创建 morphline-hbase-mapper.xml文件

<?xml version="1.0" encoding="UTF-8"?><indexer table="table" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper"><param name="morphlineFile" value="morphlines.conf"></param><param name="morphlineId" value="test"></param></indexer>

进入CM管理界面中的Key-Value Store Indexer, 修改Morphline文件

注:

id: 一定要与刚刚morphline-hbase-mapper.xml中morphlineId的value一致

inputColumn: 需要写入到solr中的HBase列字段。值包含列族和列限定符, 并用’ : '分开。其中列限定符也可以使用通配符*来表示, 譬如可以使用c1:表示读取只要列族为data的所有hbase列数据

outputField: 用来表示morphline读取的记录需要输出的数据字段名称, 该名称必须和solr中的managed-schema文件的field节点自定义的name名称或者是在collection中设置的field保持一致，否则写入不正确

type: 用来定义读取HBase数据的数据类型HBase中的数据都是以byte[]的形式保存，但是所有的内容在Solr中索引为text形式，所以需要一个方法来把byte[]类型转换为实际的数据类型。type参数的值就是用来做这件事情的。现在支持的数据类型有：byte,int,long,string,boolean,float,double,short和bigdecimal。当然你也可以指定自定的数据类型，只需要实现com.ngdata.hbaseindexer.parse.ByteArrayValueMapper接口即可

SOLR_LOCATOR : {# Name of solr collectioncollection : hbaseindexer# ZooKeeper ensemblezkHost : "$ZK_HOST" }morphlines : [{id : testimportCommands : ["org.kitesdk.**", "com.ngdata.**"]commands : [{extractHBaseCells {mappings : [{inputColumn : "test:test_id"outputField : "testId" type : stringsource : value}]}}{ logDebug { format : "output record: {}", args : ["@{}"] } }]}]

保存后重启 Key-Value Store Indexer 服务

最后, 注册 Lily HBase Indexer Configuration 和 Lily HBase Indexer Service

hbase-indexer add-indexer \--name indexer \--indexer-conf /opt/testindex/test/morphline-hbase-mapper.xml \--connection-param solr.zk=node-01:2181,node-03:2181,node-05:2181/solr \--connection-param solr.collection=cdc \--zookeeper node-01:2181,node-03:2181,node-05:2181

注:

这里的 --name是指定indexer的name 可以自定义

–indexer-conf 是刚刚我们创建并且编辑好的morphline-hbase-mapper.xml的路径

solr.zk 哪个节点有solr就写哪个

solr.collection 指定创建好的collection

–zookeeper zookeeper节点

到这里部署就已经完成了, 这时向我们的表中put几条数据后再等个几秒钟就可以在solr上查询到对应的数据了.

命令汇总

# 列出所有indexerhbase-indexer list-indexers# 删除指定indexerhbase-indexer delete-indexer --name XXX# 列出所有collectionsolrctl collection --list# 删除collectionsolrctl collection --delete XXX# 列出所有instancedirsolrctl instancedir --list# 删除instancedirsolrctl instancedir --delete XXX