매칭 점수

정규식 활용의 필수성
- web page
web dictionary
score

from collections import defaultdict
import re
from decimal import Decimal
def solution(word, pages) :
    word = word.lower()
    web = defaultdict(defaultdict)
    link_score = defaultdict(int)
    url_pattern = re.compile('<meta property=\"og:url\" content=\"(.+?)"')
    body_pattern = re.compile(r'\<body>\n(.+?)\n\</body>', re.S)
    ext_url_pattern = re.compile('<a href="(.+?)"')
    for index, page in enumerate(pages) :
        page = page.lower()
        url = url_pattern.findall(page).pop()
        web[url]['index'] = index
        body = ' '.join(body_pattern.findall(page))
        web[url]['ext_url'] = ext_url_pattern.findall(body)
        web[url]['basic_score'] = Decimal(str(re.sub('[^a-z]', '.', body).split('.').count(word)))
        for ext_url in web[url]['ext_url'] :
            link_score[ext_url] += web[url]['basic_score'] / len(web[url]['ext_url'])
    return sorted([[val['basic_score'] + link_score[url], val['index']] for url, val in web.items()], key = lambda x : (-x[0], x[1]))[0][1]

정규식 활용의 필수성

이거 정규식 아예 모르면 절대로 풀 수 없는 문제다. 정규식때문에 대가리 깨질뻔했지만, 검색해가며 어찌어찌 해결했다. 일단 웹페이지 구성을 알아야한다.

web page

<html lang="ko" xml:lang="ko" xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta charset="utf-8">
    <meta property="og:url" content="https://b.com"/>
  </head>  
  <body>
    Suspendisse potenti. Vivamus venenatis tellus non turpis bibendum, 
    <a href="https://a.com"> Link to a </a>
    blind sed congue urna varius. Suspendisse feugiat nisl ligula, quis malesuada felis hendrerit ut.
    <a href="https://c.com"> Link to c </a>
  </body>
</html>

page url이 나오는 위치 = <meta property="og:url" content="이곳"

words & 외부링크가 나오는 위치 = <body> 이곳 </body>

body태그 안에 외부링크가 나오는 위치 = <a href="이곳"

pages에서 우리에게 필요한 정보를 뽑기위해 정규식 패턴을 만들어야한다.

    url_pattern = re.compile('<meta property=\"og:url\" content=\"(.+?)"')
    body_pattern = re.compile(r'\<body>\n(.+?)\n\</body>', re.S)
    ext_url_pattern = re.compile('<a href="(.+?)"')

이 패턴이 이해되지 않으면 정규식 공부를 먼저 해야한다. 알아둬서 좋은수준이아니고 꼭 알아야 다음 코딩테스트 때 쉽게 문제를 풀 수 있을 것이다. 정규식을 이용한 문제는 자주 나온다.

이 패턴들을 이용해 re.findall로 필요한 문자열을 뽑아내야한다.

web dictionary

각 웹페이지의 정보들을 저장할 자료구조가 필요하다.

web = {'url1' : {'index' : 웹페이지 번호,
                 'ext_url' : 외부 링크,
                 'basic_score' : 기본점수},
       'url2' : {'index' : 웹페이지 번호,
                 'ext_url' : 외부 링크,
                 'basic_score' : 기본점수}}
link_score = {'url1' : link_score,
              'url2' : link_score}

web, ext_score의 자료구조 내부는 이 자료들을 저장한다.

score

구해야하는 매칭점수는 basic score, link score다. basic score는 각 웹페이지의 body문에서 구하면 되지만, link score는 다른 웹페이지로부터 받아야한다. 그렇게되면 반복문 2번을 돌려야하는데 차라리

각 웹페이지의 외부 링크의 link score를 구해서 외부 웹페이지의 link score에 더해준다.

이렇게하면 pages의 각 웹페이지를 한바퀴만 돌면 web, link_score가 구해진다.

from collections import defaultdict
import re
from decimal import Decimal
def solution(word, pages) :
    word = word.lower()
    web = defaultdict(defaultdict)
    link_score = defaultdict(int)
    url_pattern = re.compile('<meta property=\"og:url\" content=\"(.+?)"')
    body_pattern = re.compile(r'\<body>\n(.+?)\n\</body>', re.S)
    ext_url_pattern = re.compile('<a href="(.+?)"')
    for index, page in enumerate(pages) :
        page = page.lower()
        url = url_pattern.findall(page).pop() # url을 뽑아낸다.
        web[url]['index'] = index # url's index 
        body = ' '.join(body_pattern.findall(page)) # body문 뽑아낸다.
        web[url]['ext_url'] = ext_url_pattern.findall(body) # #body 문안의 모든 외부링크 뽑아낸다.
        web[url]['basic_score'] = Decimal(str(re.sub('[^a-z]', '.', body).split('.').count(word))) # basic score 
        for ext_url in web[url]['ext_url'] : #모든 외부 링크에대해 Link score를 구한 후 해당 웹페이지의 link score 갱신
            link_score[ext_url] += web[url]['basic_score'] / len(web[url]['ext_url'])
    '''
    복잡해 보이는 코드지만, 각 웹페이지의 [기본점수 + 링크점수] 리스트를 만들어 정렬한 후 가장큰 매칭점수를 뽑는 코드다.
    '''
    return sorted([[val['basic_score'] + link_score[url], val['index']] for url, val in web.items()], key = lambda x : (-x[0], x[1]))[0][1]

매칭 점수

정규식 활용의 필수성

web page

web dictionary

score

Similar Posts

Comments