동의어 등록 테스트 ( nori / Elasticsearch v6.5.4)

검색/ElasticSearch

동의어 등록 테스트 ( nori / Elasticsearch v6.5.4)

듐듐다다 2021. 8. 26. 14:22

동의어 사전이 적용되는 시점

Tokenizer를 통해 형태소 분석이 끝난 후에 Token Filters에 정의한 순서 대로 사전이 적용된다.

동의어 등록

ELS_HOME/config/synonyms.txt
동의어는 하나의 규칙당 한 줄식 입력해야 하며 파일은 UTF8로 인코딩 되어야한다.

에어프라이어,에어플라이어,애어프라이어,애어플라이어
에이아이,인공지능
대한민국,우리나라,한국
한국,코리아,korea
아름다움,멋=>뿜뿜

동의어 등록 후에는 _setting 정보를 update하기 위해 indices를 닫았다가 열어야한다.

POST synonyms_dic_test/_close
POST synonyms_dic_test/_open

이미 색인된 데이터에 사전을 반영하기 위해서는 재색인을 해야한다

동의어 적용 테스트

기존 인덱스 삭제

DELETE synonyms_dic_test

인덱스 생성

PUT synonyms_dic_test/
{
  "mappings": {
    "product": {
      "properties": {
        
        "product_sub_title": {
          "type": "text",
          "analyzer": "korean"
        },
        "product_title": {
          "type": "text",
          "analyzer": "korean"
        }
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": "3",
      "refresh_interval" : "-1",
      "analysis": {
        "filter": {
          "stop_filter": {
            "type": "stop",
            "stopwords_path": "stopwords.txt"
          },
          "synonym": {
            "type": "synonym",
            "synonyms_path": "dic/synonyms.txt",
            "updateable": "true"
          }
        },
        "char_filter": {
          "decimal_mark_filter": {
            "pattern": "(\\d+),(?=\\d)",
            "type": "pattern_replace",
            "replacement": "$1 "
          }
        },
        "analyzer": {
          "korean": {
            "filter": [
              "lowercase",
              "synonym",
              "stop_filter"
            ],
            "type": "custom",
            "tokenizer": "korean_default_tokenizer"
          },
          "category_seq_analyzer": {
            "char_filter": [
              "decimal_mark_filter"
            ],
            "tokenizer": "standard"
          }
        },
        "tokenizer": {
          "korean_default_tokenizer": {
            "type": "nori_tokenizer",
            "user_dictionary": "userdict_ko.txt",
            "decompound_mode": "discard"
          }
        }
      },
      "number_of_replicas": "1"
    }
  }
}

테스트 데이터 삽입

PUT synonyms_dic_test/product/1
{
  "product_title":"대한민국에 오신것을 환영합니다."
}
PUT synonyms_dic_test/product/2
{
  "product_title":"우리나라는 강산이 어우러져 자연의 아름다움을 느낄 수 있습니다."
}
PUT synonyms_dic_test/product/3
{
  "product_title":"한국의 멋"
}
PUT synonyms_dic_test/product/4
{
  "product_title":"korea는 내나라"
}
PUT synonyms_dic_test/product/5
{
  "product_title":"우리나라의 자랑"
}
PUT synonyms_dic_test/product/6
{
  "product_title":"먼지가 뿜뿜"
}

검색 테스트 1

사전 등록

대한민국,우리나라,한국
한국,코리아,korea

검색 : '우리나라'
결과 : 사전에 등록한 키워드(대한민국,우리나라,한국,코리아,korea)를 포함한 문서가 검색 된다.

//검색
GET synonyms_dic_test/product/_search
{
  "query": {
    "match": {
      "product_title":{
        "query":"우리나라",
      "operator": "and"
      }
    }
  }
}

//결과
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 6.9298778,
    "hits" : [
      {
        "_index" : "synonyms_dic_test",
        "_type" : "product",
        "_id" : "5",
        "_score" : 6.9298778,
        "_source" : {
          "product_title" : "우리나라의 자랑"
        }
      },
      {
        "_index" : "synonyms_dic_test",
        "_type" : "product",
        "_id" : "2",
        "_score" : 4.0648756,
        "_source" : {
          "product_title" : "우리나라는 강산이 어우러져 자연의 아름다움을 느낄 수 있습니다."
        }
      },
      {
        "_index" : "synonyms_dic_test",
        "_type" : "product",
        "_id" : "3",
        "_score" : 2.2562294,
        "_source" : {
          "product_title" : "한국의 멋"
        }
      },
      {
        "_index" : "synonyms_dic_test",
        "_type" : "product",
        "_id" : "1",
        "_score" : 1.6988081,
        "_source" : {
          "product_title" : "대한민국에 오신것을 환영합니다."
        }
      },
      {
        "_index" : "synonyms_dic_test",
        "_type" : "product",
        "_id" : "4",
        "_score" : 0.44839138,
        "_source" : {
          "product_title" : "korea는 내나라"
        }
      }
    ]
  }
}

검색 테스트 2

사전 등록

아름다움,멋=>뿜뿜

검색 : ‘아름다움’
결과 : 사전에 등록한 키워드(아름다움,멋=>뿜뿜) 대로 ‘아름다움’ or ‘멋' 으로 검색시에 ‘아름다움, 멋, 뿜뿜’을 포함한 문서가 검색 된다.

//검색
GET synonyms_dic_test/product/_search
{
  "query": {
    "match": {
      "product_title":{
        "query":"아름다움",
      "operator": "and"
      }
    }
  }
}

//결과
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 2.5682926,
    "hits" : [
      {
        "_index" : "synonyms_dic_test",
        "_type" : "product",
        "_id" : "6",
        "_score" : 2.5682926,
        "_source" : {
          "product_title" : "먼지가 뿜뿜"
        }
      },
      {
        "_index" : "synonyms_dic_test",
        "_type" : "product",
        "_id" : "3",
        "_score" : 2.3460367,
        "_source" : {
          "product_title" : "한국의 멋"
        }
      },
      {
        "_index" : "synonyms_dic_test",
        "_type" : "product",
        "_id" : "2",
        "_score" : 1.5926098,
        "_source" : {
          "product_title" : "우리나라는 강산이 어우러져 자연의 아름다움을 느낄 수 있습니다."
        }
      }
    ]
  }
}

검색 테스트 3
'B,C,D =>A' 등록 방식을 ‘A=>B,C,D’로 하면 어떤 결과가 나오는지 확인한다.
(A를 검색 할때 B,C,D문서가 모두 나올까?)

사전	검색어	결과
뿜뿜=>아름다움,멋,자랑	뿜뿜	_id:6 (먼지가 뿜뿜) _id:2(우리나라는 강산이 어우러져 자연의 아름다움을 느낄 수 있습니다.)
	멋	_id:6, _id:3 (한국의 멋)
	아름다움	_id:6, _id:2
	자랑	_id:6, _id:5(우리나라의 자랑)
뿜뿜=>멋,자랑,아름다움	뿜뿜	_id:6, _id:2
뿜뿜=>자랑,아름다움,멋	뿜뿜	_id:6, _id:2
자랑=>뿜뿜,아름다움,멋	자랑	_id:6, _id:2, _id:5
자랑=>아름다움,멋,뿜뿜	자랑	_id:6, _id:2, _id:5
자랑=>멋,아름다움,뿜뿜	자랑	_id:6, _id:2, _id:5
아름다움=>멋,자랑,뿜뿜	아름다움	_id:6, _id:2
아름다움=>자랑,뿜뿜,멋	아름다움	_id:6, _id:2

결론 : 사전을 등록할 때 A=>B,C,D 로 등록을 하면 A로 검색 할 경우 B,C,D중 하나 혹은 두 개가 임으로 매칭되는 것 같지만 규칙은 모르겠다.
실제로 A 위치에 해당하는 키워드를 _termvectors로 확인해 보면 B,C,D 키워드를 모두 가지고 있다.

/*
사전에 '자랑=>멋,아름다움,뿜뿜' 등록 후
'우리나라의 자랑'이라는 문구가 들어간 _id:5을 termvectors 로 확인 해 보면
'자랑'이 '뿜','아름답','멋'으로 변환 되어 들어가 있다.
그리고 '뿜뿜','아름다움','멋'으로 검색 할 경우 '자랑'이 들어간 문서가 함께 결과로 나오지만,
'자랑'이라는 검색 결과로는 B,C,D 위치 중 어떤 키워드가 들어간 문장이 나올지 예측 할 수없다.
*/
GET synonyms_dic_test/product/5/_termvectors?
{
  "fields" : ["product_title"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

===
{
  "_index" : "synonyms_dic_test",
  "_type" : "product",
  "_id" : "5",
  "_version" : 1,
  "found" : true,
  "took" : 0,
  "term_vectors" : {
    "product_title" : {
      "field_statistics" : {
        "sum_doc_freq" : 40,
        "doc_count" : 4,
        "sum_ttf" : 42
      },
      "terms" : {
        "ᄆ" : {
          "doc_freq" : 2,
          "ttf" : 2,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 4,
              "start_offset" : 6,
              "end_offset" : 8
            }
          ]
        },
        "나라" : {
          "doc_freq" : 3,
          "ttf" : 3,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 2,
              "end_offset" : 4
            }
          ]
        },
        "대한" : {
          "doc_freq" : 2,
          "ttf" : 2,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 2
            }
          ]
        },
        "멋" : {
          "doc_freq" : 1,
          "ttf" : 1,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 3,
              "start_offset" : 6,
              "end_offset" : 8
            }
          ]
        },
        "민국" : {
          "doc_freq" : 2,
          "ttf" : 2,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 2,
              "end_offset" : 4
            }
          ]
        },
        "뿜" : {
          "doc_freq" : 2,
          "ttf" : 4,
          "term_freq" : 2,
          "tokens" : [
            {
              "position" : 3,
              "start_offset" : 6,
              "end_offset" : 8
            },
            {
              "position" : 4,
              "start_offset" : 6,
              "end_offset" : 8
            }
          ]
        },
        "아름답" : {
          "doc_freq" : 2,
          "ttf" : 2,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 3,
              "start_offset" : 6,
              "end_offset" : 8
            }
          ]
        },
        "우리" : {
          "doc_freq" : 2,
          "ttf" : 2,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 2
            }
          ]
        },
        "의" : {
          "doc_freq" : 3,
          "ttf" : 3,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 2,
              "start_offset" : 4,
              "end_offset" : 5
            }
          ]
        },
        "한국" : {
          "doc_freq" : 3,
          "ttf" : 3,
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 4
            }
          ]
        }
      }
    }
  }
}

참고 링크

*같지만 다른 단어: 동의어로 Elasticsearch의 성능 강화 :
https://www.elastic.co/kr/blog/boosting-the-power-of-elasticsearch-with-synonyms

* 동의어 :
https://esbook.kimjmin.net/06-text-analysis/6.6-token-filter/6.6.3-synonym