겐지충 프로그래머 :: 자료구조

1. 트라이(Trie)란?

트라이(Trie)는 효율적인 문자열 저장 및 탐색이 가능한 자료구조의 형태이다.

우리가 수많은 문자열을 저장하였는데 입력으로 어떤 문자열이 주어졌을 때, 그것이 내가 저장한 구조에 포함된 문자열인지 어떻게 알 수 있을까? 아래와 같은 방식이 있을 것이다.

① 단순 비교

전체 문자열의 맨 처음부터 끝까지 개별 문자를 비교하는 방식
최대 문자열 길이 m, 전체 문자 개수 n개 시 시간복잡도 : 최악의 경우 O(nm)

② 이진 탐색 기법

전체 문자열을 사전 순으로 배열 저장 후, 중간 값과 비교해 사전적으로 이전이면 좌측의 반 / 이후면 우측의 반으로 비교하는 방식을 반복
최대 문자열 길이 m, 전체 문자 개수 n개 시 시간복잡도 : 정렬 O(nmlogn) / 탐색 O(mlogn) ▶ 즉, O(nmlogn)

③ 트라이(Trie)

K-진 트리 구조를 통해 문자열 저장하는 방식
영어사전의 방식을 그대로 차용함 - game이라면 단어를 찾는 다면 우리는 사전에서 g를 찾고 순서대로 a, m, e를 찾는데 이를 트리형식으로 구현한 것.
맨 앞의 접두어(prefix)부터 시작하여 단어 전체를 찾아가는 과정이므로 접두사 트리(Prefix-Tree) 라고도 한다.
트라이(Trie)는 문자열 저장을 위한 공간을 많이 쓰지만 탐색에 매우 효율적이라는 특성을 갖는다.
최대 문자열 길이 m 일 때 문자의 개수와 무관하게 시간복잡도는 O(m)
- 각 문자 하나를 배열의 위치로 지정하여 문자열 하나 찾을 때 O(1)이므로 길이만큼의 시간만 소요함
- 만약, 문자열 길이가 너무 너무 커서 Map 등을 이용해 동적할당을 해야 하는 경우 O(mlogn) 만큼 소요할 수 있다.

그림을 통해 어떻게 구현되는지 좀 더 직관적으로 이해해보자.

다음과 같은 문자 6개가 있는데, 이를 K-진 트리 형태 구조로 표현한다면 어떨까?
["bar", "bag", "bark", "dog", "do", "door"]

2. 트라이(Trie) 구현하기

트라이(Trie)를 구현하기 위한 코드를 하나씩 살펴보자. 트리(Tree) 구조의 각 노드(Node)에는 각 문자열의 일부를 포함하며 자신과 연결되는 문자열의 일부를 가지고 있어야 한다는 점을 명심하자.

여기서는 문자열이 전부 영어 소문자로만 이루어졌다고 판단하고 진행하자.

① 생성자 / 클래스 정의

루트 노드(Node)는 아무런 문자열(접두사)도 포함하지 않고 모든 문자열의 접두사들을 자식 배열로 갖고 있어야 한다.

아래의 코드를 참고하자. Trie 클래스의 생성자는 루트 노드(Node)의 내용을 정의한다. Node 클래스는 각 Node의 정보를 정의한 클래스로 자신의 문자, 연결되는 문자들의 정보를 갖고 있고 문자열이 완성되는가 여부 또한 저장하고 있다.

public class Trie {

    Node root;
    static final int ALPHABET_SIZE = 26; // a-z는 26개
    public Trie(){
        this.root = new Node();
        this.root.val = ' ';
    }
    
    private static class Node{
        Node[] child = new Node[ALPHABET_SIZE]; // 뒤로 연결되는 문자열 a-z 소문자를 index화하여 저장하는 배열(26개)
        boolean isTerminal = false;  // 현재 노드가 문자 완성이 되는 노드인지 여부
        int childNum = 0;            // 현재 노드에 연결된 문자열의 개수
        char val;                    // 현재 노드의 값
    }
}

② 문자열 삽입하기 (insert, charToInt)

아래의 코드를 보면 쉽게 이해할 수 있다.

charToInt 메소드는 각 문자열의 1개 단어를 int형 숫자로 변환해주는 메소드이다. (내부적으로만 사용하니 private형)
insert 메소드는 전체 문자열을 쪼개서 각 노드(Node)에 저장하는 메소드이다.

    private int charToInt(char c){
        return c - 'a'; // 여기서는 소문자만 있으므로 'a'를 빼면 된다.
    }

    public void insert(String str){
        int length = str.length();
        Node current = this.root;       // 루트 부터 시작해서 내려감
        for(int i=0; i < length; i++){
            char c = str.charAt(i);      // 전체 문자열의 일부 단어 추출
            int num = this.charToInt(c); // 추출한 단어를 숫자로 변환

            if(current.child[num] == null){     // 기존에 null이면 연결 문자열로 처음 추가되는 것
                current.child[num] = new Node();
                current.child[num].val = c;
                current.childNum++;
            }

            current = current.child[num];       // 자식 노드로 넘어감
        }
        current.isTerminal = true;
    }

③ 찾기 (find)

단어를 저장했으면 저장되어 있는지 확인할 수 있어야 한다. 이를 find 메소드를 통해 구현한다.

find 메소드 : 사용자가 호출 시 사용, 간편히 문자열만 전달

    // 반복문으로 노드를 순환하여 문자열 존재 여부 판단
    public boolean find(String str){
        int length = str.length();
        Node current = this.root; // 현재 노드 설정
        
        for(int i=0; i < length; i++){
        	char c = str.charAt(i);
            int num = this.charToInt(c);
            if(current.child[num] == null){ // 문자열의 일부를 추출했는데 null 이라면 false 반환
            	return false;
            }
            current = current.child[num];
        }
        return current != null && current.isTerminal; // 문자열의 마지막이라면 true
    }

④ 전체 저장된 문자열 내역 print 하기

모든 저장된 단어의 내역을 알기 쉽게 프린트할 수 있도록 하는 메소드를 구현해볼 수 있다. 아래 코드를 참조하자.

    private char intToChar(int i){
        return (char)(i + (int)'a');
    }

    public void printTrie(){ // 사용자가 간편히 호출만 하면 되는 메소드
        this.printTrie(this.root, 0, new StringBuilder());
    }

    // 내부에서 재귀적으로 순환하여 노드에 저장된 값들 추출해 프린트
    private void printTrie(Node node, int idx, StringBuilder sb){
        Node current = node;
        Node[] child = current.child;
        StringBuilder builder = new StringBuilder(sb);

        // 루트 노드에는 저장된 것이 없으므로 그 외의 경우에만 append
        if(!current.equals(this.root)){
            builder.append(intToChar(idx));
        }

        // 완성 노드라면 프린팅하면 된다.
        if(current.isTerminal){
            System.out.println(builder);
        }

        // 연결된 노드들을 순환하기 위해 반복문 사용
        for(int i=0; i < ALPHABET_SIZE; i++){
            // null 이라면 거기에는 저장된 것이 없다는 의미이므로 건너 뜀
            if(current.child[i] == null){
                continue;
            }
            printTrie(child[i], i, builder); // 재귀적으로 순환
        }
    }

⑤ 삭제하기 (delete)

삭제는 재귀적으로 bottom-up 방식으로 진행된다. Trie에서 삭제하는 로직은 다음과 같다.

1) 삭제할 문자가 다른 문자의 접두사인 경우 - isTerminal을 false 변경
- Do는 Door의 접두사가 된다. 따라서, Do를 삭제한다면 D , o 에서 o에 isTerminal만 false로 변경한다.
- 단순히 삭제하면 Door 또한 사라지게 된다.
2) 삭제할 문자가 Unique하여 다른 문자와 연관이 없는 경우 - 관련 모든 노드 삭제
3) 삭제할 문자의 일부가 전체 삭제 문자의 접두사인 경우 - 다른 문자에 영향가지 않는 곳까지만 삭제
- Door를 삭제하려고 한다면 Do가 있으니, 전체 삭제를 할 수 없고 Door에서 뒤의 o, r만 삭제해야 한다.

    // 사용자가 호출 시 사용하는 메소드
    public void delete(String str){
        delete(this.root, str, 0);
    }

    // 내부적으로 재귀를 통해 삭제를 진행하는 메소드
    private void delete(Node current, String str, int idx){
        int leng = str.length();

        // 자식이 없는데 string의 length의 끝까지 오지 않았다면 예외 발생
        // 끝까지 갔는데 해당 노드가 terminal가 아니었다면 해당 단어를 저장하지 않은 것이므로 예외 발생
        if((current.childNum == 0 && idx != leng) || (idx == leng && !current.isTerminal)) {
            throw new NoSuchElementException("Value " + str + " does not exist in Trie!");
        }

        // 문자열의 마지막에 다다른 경우
        if(idx == leng){
            current.isTerminal = false;

            // 자식이 없는데 문자의 마지막이었다면 해당 문자만 저장된 것이므로 null 처리
            if(current.childNum == 0){
                current = null;
            }
        } else {
            char c = str.charAt(idx);
            int num = charToInt(c);

            // 삭제 후 돌아오는 부분
            delete(current.child[num], str,idx+1);

            // child가 null처리 되었다면 자식 노드의 수가 하나 줄어든 것이므로 -- 처리
            if(current.child[num] == null) current.childNum--;

            // 현재 노드의 자식이 없고, 단어의 마지막도 아니라면 삭제해야 한다.
            if(current.childNum == 0 && !current.isTerminal){
                current = null;
            }
        }
    }

※ 참고로 아래에서는 배열을 이용해 구현하였으나, 이는 자식으로 들어올 종류를 알고 있을 경우에 해당할 것이다.(여기서는 영어 소문자만) 실제로는 각 Node를 HashMap 등을 이용해 구현하는 것이 더 일반적이다.

전체 코드는 아래와 같다.

import java.util.NoSuchElementException;

public class Trie {
    public static void main(String[] args){
        Trie trie = new Trie();
        trie.insert("bar");
        trie.insert("bag");
        trie.insert("bark");
        trie.insert("dog");
        trie.insert("do");
        trie.insert("door");

        System.out.println(trie.find("bag") ? "Yes!, bag is exist!" : "No, bag does not exist..");
        System.out.println(trie.find("baga") ? "Yes!, baga is exist!" : "No, baga does not exist..");
        System.out.println("Trie 내 문자열 전체 출력");
        trie.printTrie();

        System.out.println("Trie 내 일부 문자 삭제 테스트");
        try {
            trie.delete("do");
            trie.delete("doll");

        } catch (Exception e) {
            System.out.println(e.getMessage());
        }

        System.out.println("Trie 내 문자열 전체 출력");
        trie.printTrie();
    }

    Node root;
    static final int ALPHABET_SIZE = 26; // a-z는 26개
    public Trie(){
        this.root = new Node();
        this.root.val = ' ';
    }

    private static class Node{
        Node[] child = new Node[ALPHABET_SIZE]; // 뒤로 연결되는 문자열 a-z 소문자를 index화하여 저장하는 배열(26개)
        boolean isTerminal = false;  // 현재 노드가 문자 완성이 되는 노드인지 여부
        int childNum = 0;            // 현재 노드에 연결된 문자열의 개수
        char val;                    // 현재 노드의 값
    }

    private int charToInt(char c){
        return c - 'a'; // 여기서는 소문자만 있으므로 'a'를 빼면 된다.
    }

    public void insert(String str){
        int length = str.length();
        Node current = this.root;        // 루트 부터 시작해서 내려감
        for(int i=0; i < length; i++){
            char c = str.charAt(i);      // 전체 문자열의 일부 단어 추출
            int num = this.charToInt(c); // 추출한 단어를 숫자로 변환

            if(current.child[num] == null){     // 기존에 null이면 연결 문자열로 처음 추가되는 것
                current.child[num] = new Node();
                current.child[num].val = c;
                current.childNum++;
            }

            current = current.child[num];       // 자식 노드로 넘어감
        }
        current.isTerminal = true;
    }

    // 반복문으로 노드를 순환하여 문자열 존재 여부 판단
    public boolean find(String str){
        int length = str.length();
        Node current = this.root; // 현재 노드 설정

        for(int i=0; i < length; i++){
            char c = str.charAt(i);
            int num = this.charToInt(c);
            if(current.child[num] == null){ // 문자열의 일부를 추출했는데 null 이라면 false 반환
                return false;
            }
            current = current.child[num];
        }
        return current != null && current.isTerminal; // 문자열의 마지막이라면 true
    }

    private char intToChar(int i){
        return (char)(i + (int)'a');
    }

    public void printTrie(){ // 사용자가 간편히 호출만 하면 되는 메소드
        this.printTrie(this.root, 0, new StringBuilder());
    }

    // 내부에서 재귀적으로 순환하여 노드에 저장된 값들 추출해 프린트
    private void printTrie(Node node, int idx, StringBuilder sb){
        Node current = node;
        Node[] child = current.child;
        StringBuilder builder = new StringBuilder(sb);

        // 루트 노드에는 저장된 것이 없으므로 그 외의 경우에만 append
        if(!current.equals(this.root)){
            builder.append(intToChar(idx));
        }

        // 완성 노드라면 프린팅하면 된다.
        if(current.isTerminal){
            System.out.println(builder);
        }

        // 연결된 노드들을 순환하기 위해 반복문 사용
        for(int i=0; i < ALPHABET_SIZE; i++){
            // null 이라면 거기에는 저장된 것이 없다는 의미이므로 건너 뜀
            if(current.child[i] == null){
                continue;
            }
            printTrie(child[i], i, builder); // 재귀적으로 순환
        }
    }

    // 사용자가 호출 시 사용하는 메소드
    public void delete(String str){
        delete(this.root, str, 0);
    }

    // 내부적으로 재귀를 통해 삭제를 진행하는 메소드
    private void delete(Node current, String str, int idx){
        int leng = str.length();

        // 자식이 없는데 string의 length의 끝까지 오지 않았다면 예외 발생
        // 끝까지 갔는데 해당 노드가 terminal가 아니었다면 해당 단어를 저장하지 않은 것이므로 예외 발생
        if(current == null || (current.childNum == 0 && idx != leng) || (idx == leng && !current.isTerminal)) {
            throw new NoSuchElementException("Value " + str + " does not exist in Trie!");
        }

        // 문자열의 마지막에 다다른 경우
        if(idx == leng){
            current.isTerminal = false;

            // 자식이 없는데 문자의 마지막이었다면 해당 문자만 저장된 것이므로 null 처리
            if(current.childNum == 0){
                current = null;
            }
        } else {
            char c = str.charAt(idx);
            int num = charToInt(c);

            // 삭제 후 돌아오는 부분
            delete(current.child[num], str,idx+1);

            // child가 null처리 되었다면 자식 노드의 수가 하나 줄어든 것이므로 -- 처리
            if(current.child[num] == null) current.childNum--;

            // 현재 노드의 자식이 없고, 단어의 마지막도 아니라면 삭제해야 한다.
            if(current.childNum == 0 && !current.isTerminal){
                current = null;
            }
        }
    }
}

/*
아래는 결과 내용
Yes!, bag is exist!
No, baga does not exist..
Trie 내 문자열 전체 출력
bag
bar
bark
do
dog
door

Trie 내 일부 문자 삭제 테스트
Value doll does not exist in Trie!

// 아래 내용 중 삭제된 do는 없는 것을 확인할 수 있다.
Trie 내 문자열 전체 출력
bag
bar
bark
dog
door
*/

3. Longest prefix Matching

트라이(Trie)를 기반으로 Longest prefix Matching을 구현할 수도 있다. 트라이 내에 저장되어 있는 단어들 중 새로 전달하는 단어와 처음부터 가장 길게 매칭되는 단어를 찾는 것이다.

※ Longest Prefix Matching(LPM)이란?
네트워크의 라우터 상에서 IP Packet을 다음 Hop으로 전송하기 위해 사용하는 방식.
다음 Hop으로 전달 시에 현재 매칭 되는 IP 중에서 CIDR 기반 prefix가 가장 길게 매칭되는 것을 찾는 방식, 즉 IP가 앞에서부터 가장 길게 매칭되는 것을 찾는다.

예를 들어, 트라이에 {show, me, the, money} 로 4가지의 단어가 저장되어 있다고 가정하자. 이 때, showroom, meter 와 같은 단어가 전달된다면 LPM에 해당하는 단어는 어떤 것일까?

showroom의 LPM은 show가 되며, meter의 LPM은 me가 될 것이다. 이와 같이 저장된 단어들 중 가장 길게 매칭되는 단어를 찾는 과정을 구현해보자.

코드는 다음과 같다. getLPM 메소드 부분을 참고하자.

package com.test;

public class Trie {
    public static void main(String[] args){
        Trie trie = new Trie();
        trie.insert("main");
        trie.insert("battle");
        trie.insert("show");
        trie.insert("do");
        trie.insert("dog");
        trie.insert("door");

        String input = "showroom";
        System.out.println("[" + input + "] LPM - " + trie.getLPM(input));

        input = "battlefield";
        System.out.println("[" + input + "] LPM - " + trie.getLPM(input));

        input = "d";
        System.out.println("[" + input + "] LPM - " + trie.getLPM(input));
    }

    Node root;
    static final int ALPHABET_SIZE = 26; // a-z는 26개
    public Trie(){
        this.root = new Node();
        this.root.val = ' ';
    }

    private static class Node{
        Node[] child = new Node[ALPHABET_SIZE]; // 뒤로 연결되는 문자열 a-z 소문자를 index화하여 저장하는 배열(26개)
        boolean isTerminal = false;  // 현재 노드가 문자 완성이 되는 노드인지 여부
        int childNum = 0;            // 현재 노드에 연결된 문자열의 개수
        char val;                    // 현재 노드의 값
    }

    private int charToInt(char c){
        return c - 'a'; // 여기서는 소문자만 있으므로 'a'를 빼면 된다.
    }

    public void insert(String str){
        int length = str.length();
        Node current = this.root;       // 루트 부터 시작해서 내려감
        for(int i=0; i < length; i++){
            char c = str.charAt(i);      // 전체 문자열의 일부 단어 추출
            int num = this.charToInt(c); // 추출한 단어를 숫자로 변환

            if(current.child[num] == null){     // 기존에 null이면 연결 문자열로 처음 추가되는 것
                current.child[num] = new Node();
                current.child[num].val = c;
                current.childNum++;
            }

            current = current.child[num];       // 자식 노드로 넘어감
        }
        current.isTerminal = true;
    }

    // LPM을 찾는 메소드
    public String getLPM(String input){
        // 반환할 String 값을 지정
        String ret = "";
        int leng = input.length();

        Node current = this.root;

        int matchIdx = 0;
        for(int i=0; i < leng; i++){
            char c = input.charAt(i);
            int num = charToInt(c);

            // 자식이 null이 아니면 해당 자식노드로 이동
            if(current.child[num] != null){
                ret += c; // 해당 value를 반환 값에 추가
                current = current.child[num]; // 자식노드로 이동

                // 해당 자식 노드가 단어의 끝이라면 Trie에 저장된 것임.
                // i+1을 matchIdx에 저장하여 substring에 활용할 것으로 준비
                if(current.isTerminal){
                    matchIdx = i+1;
                }
            } else {
                break;
            }
        }
        // 반복문을 다 돌았는데 현재 위치가 단어의 마지막이 아니라면
        if(!current.isTerminal){
            // matchIdx로 substring을 해야 한다.
            return ret.substring(0, matchIdx);
        // 반복문을 다 돌았는데 마지막이라면 해당 단어가 LPM임    
        } else {
            return ret;
        }
    }
}

/*
결과
[showroom] LPM - show
[battlefield] LPM - battle
[d] LPM - 

Process finished with exit code 0
*/

4. 관련 문제

아래와 같이 지속 업데이트 후 추후 정답 코드 또한 올려두도록 하겠습니다.

백준 5670 : www.acmicpc.net/problem/5670

오류가 있는 부분은 댓글로 남겨주시면 반영하겠습니다. 감사합니다.

저작자표시 (새창열림)

'자바 프로그래밍 > 자료구조(Data Structure)' 카테고리의 다른 글

자료구조 - 정렬 (개요) (0)	2020.11.06
자료구조 - 세그먼트 트리(Segment Tree)2 - 느린 전파 (0)	2020.11.02
자료구조 - 그래프(Graph) (0)	2020.10.20
자료구조 - 우선순위 큐(Heap, Priority Queue) (2)	2020.09.30
자료구조 - 스택, 큐 직접 구현(Stack, Queue) (0)	2020.09.30

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

겐지충 프로그래머

자료구조 - 트라이(Trie)

1. 트라이(Trie)란?

2. 트라이(Trie) 구현하기

3. Longest prefix Matching

4. 관련 문제

'자바 프로그래밍 > 자료구조(Data Structure)' 카테고리의 다른 글

티스토리툴바

1. 트라이(Trie)란?

2. 트라이(Trie) 구현하기

3. Longest prefix Matching

4. 관련 문제

'자바 프로그래밍 > 자료구조(Data Structure)' 카테고리의 다른 글

검색

티스토리툴바