Re: IN NATURAL LANGUAGE MODEにおける検索スコア (groonga-dev,02812) - Groonga - fulltext search engine.

こんにちは。

On 25/09/2014 15:26, Kouhei Sutou wrote:
> あと、これはプラグマではなくて演算子です。
> 
>   a *S"b c" d
> 
> みたいに条件の1つとして使います。
> 
>> 確認ですが、 AGAINST('*S1 a b c' IN BOOLEAN MODE) みたいに、二重引用符で
>> のフレーズ検索じゃなくても使えますよね？
> 
> 使えません。「*S1"..."」でひとかたまりです。

試してみました。
groonga rev. b28b35654bb4d24092d55caf25e4f299f856aeea です。

DROP TABLE IF EXISTS `diaries`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `diaries` (
  `id` int(10) unsigned NOT NULL,
  `content` text COLLATE utf8_unicode_ci,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `content` (`content`) COMMENT 'parser "TokenBigram"'
) ENGINE=Mroonga DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
/*!40101 SET character_set_client = @saved_cs_client */;
INSERT INTO `diaries` VALUES (1,'It\'ll be fine tomorrow as well.');
INSERT INTO `diaries` VALUES (2,'It\'ll rain tomorrow.');
INSERT INTO `diaries` VALUES (3,'It\'s fine today. It\'ll be fine
tomorrow as well.');
INSERT INTO `diaries` VALUES (4,'It\'s fine today. But it\'ll rain
tomorrow.');
INSERT INTO `diaries` VALUES (5,'Ring the bell.');
INSERT INTO `diaries` VALUES (6,'I love dumbbells.');

SELECT *, MATCH (content) AGAINST ('*S2"fine tomorrow"') AS score FROM
diaries;

+----+--------------------------------------------------+--------+
| id | content                                          | score  |
+----+--------------------------------------------------+--------+
|  1 | It'll be fine tomorrow as well.                  |      0 |
|  2 | It'll rain tomorrow.                             |      0 |
|  3 | It's fine today. It'll be fine tomorrow as well. | 174763 |
|  4 | It's fine today. But it'll rain tomorrow.        | 174763 |
|  5 | Ring the bell.                                   |      0 |
|  6 | I love dumbbells.                                |      0 |
+----+--------------------------------------------------+--------+

2014-09-25 16:21:58.340539|i|b5453700|grn_ii_sel > (*S2"fine tomorrow")
2014-09-25 16:21:58.340783|i|b5453700|exact: 2
2014-09-25 16:21:58.340799|i|b5453700|hits=2

あれれ？

ひょっとして、natural language modeで検索したいのに、boolean modeを指定
する必要があるとか？　と思ってやってみました。

SELECT *, MATCH (content) AGAINST ('*S2"fine tomorrow"' in boolean mode)
AS score FROM diaries;
+----+--------------------------------------------------+--------+
| id | content                                          | score  |
+----+--------------------------------------------------+--------+
|  1 | It'll be fine tomorrow as well.                  | 211835 |
|  2 | It'll rain tomorrow.                             |  95326 |
|  3 | It's fine today. It'll be fine tomorrow as well. | 328344 |
|  4 | It's fine today. But it'll rain tomorrow.        | 211835 |
|  5 | Ring the bell.                                   |      0 |
|  6 | I love dumbbells.                                |      0 |
+----+--------------------------------------------------+--------+

2014-09-25 16:23:42.332156|i|b5453700|grn_ii_sel > (fine tomorrow)
2014-09-25 16:23:42.354115|i|b5453700|exact: 4
2014-09-25 16:23:42.354165|i|b5453700|hits=4

おぉ、求めていた結果が。

でも、「*S1"..."」演算子は、boolean modeでだけ使えて、その結果のスコアは
natural language mode相当です、っていうのは分かりにくすぎる仕様だと感じ
ます。その演算子のついていない部分については、出現数＝スコアで、差があり
すぎますし。

>> 25日の変更にしてはいささかチャレンジャーな気もしますが、今月29日のリリー
>> スに含まれて公式仕様になってしまうのでしょうか？
> 
> 実はこの構文はSennaのときは使えてGroongaでは使えなくなったや
> つなので、そんなにチャレンジングな変更ではないんですよ。
> 
>   http://qwik.jp/senna/query.html
> 
>> 個人的には、デフォルトのヒューリスティックな値が今より「いい感じ」になる
>> のがゴールですが、それには時間が足りなさそうですね。
> 
> ギリギリ、ですかねぃ。 :-)

変更自体は簡単で、

diff --git a/lib/ii.c b/lib/ii.c
index d10b84d..e20bdce 100644
--- a/lib/ii.c
+++ b/lib/ii.c
@@ -5695,7 +5695,9 @@ grn_ii_similar_search(grn_ctx *ctx, grn_ii *ii,
     ? (optarg->similarity_threshold > GRN_HASH_SIZE(h)
        ? GRN_HASH_SIZE(h)
        : optarg->similarity_threshold)
-    : (GRN_HASH_SIZE(h) >> 3) + 1;
+    : (GRN_HASH_SIZE(h) < 8
+       ? GRN_HASH_SIZE(h)
+       : ((GRN_HASH_SIZE(h) - 8) >> 3) + 8);
   if (GRN_HASH_SIZE(h)) {
     grn_id j, id;
     int w2, rep;

と変更してみたら、

SELECT *, MATCH (content) AGAINST ('fine tomorrow') AS score FROM diaries;

+----+--------------------------------------------------+--------+
| id | content                                          | score  |
+----+--------------------------------------------------+--------+
|  1 | It'll be fine tomorrow as well.                  | 211835 |
|  2 | It'll rain tomorrow.                             |  95326 |
|  3 | It's fine today. It'll be fine tomorrow as well. | 328344 |
|  4 | It's fine today. But it'll rain tomorrow.        | 211835 |
|  5 | Ring the bell.                                   |      0 |
|  6 | I love dumbbells.                                |      0 |
+----+--------------------------------------------------+--------+

2014-09-25 16:39:38.708716|i|fb44d700|grn_ii_sel > (fine tomorrow)
2014-09-25 16:39:38.709891|i|fb44d700|exact: 4
2014-09-25 16:39:38.709918|i|fb44d700|hits=4

になって、とても幸せになれました。

大規模なデータで、評価対象トークン数が8個というのは、多すぎて遅すぎたり
するものでしょうか？

かずひこ



Groonga - fulltext search engine.

[groonga-dev,02812] Re: IN NATURAL LANGUAGE MODEにおける検索スコア