BeautifulSoup parsers : 소스코드 해석기

Python

BeautifulSoup parsers : 소스코드 해석기

finterstellar 2019. 10. 4. 00:53

웹 크롤링, 즉 외부데이터 수집을 위해 BeautifulSoup 을 이용하다보면 어떤 parser (소스코드 해석기)를 이용할지 고민하는 경우가 있습니다.

이참에 한번 정리하고 넘어갈께요.

아래는 BeautifulSoup 공식 웹사이트에 나온 설명 입니다.

Parser	Typical usage	Advantages	Disadvantages
Python’s html.parser	`BeautifulSoup(markup, "html.parser")`	Batteries included Decent speed Lenient (As of Python 2.7.3 and 3.2.)	Not as fast as lxml, less lenient than html5lib.
lxml’s HTML parser	`BeautifulSoup(markup, "lxml")`	Very fast Lenient	External C dependency
lxml’s XML parser	`BeautifulSoup(markup, "lxml-xml")` `BeautifulSoup(markup, "xml")`	Very fast The only currently supported XML parser	External C dependency
html5lib	`BeautifulSoup(markup, "html5lib")`	Extremely lenient Parses pages the same way a web browser does Creates valid HTML5	Very slow External Python dependency

해석을 하자면,

해석기	사용 예	장점	단점
html.parser (Python 자체)	`BeautifulSoup(markup, "html.parser")`	다양한 기능 포함 속도 쓸만함 호환성 괜찮음 (Python 2.7.3 및 3.2. 이상에서)	lxml 처럼 빠르지 않음 html5lib 처럼 호환은 안됨
HTML parser (lxml)	`BeautifulSoup(markup, "lxml")`	속도 매우 빠름 호환성 좋음	C 의존성
XML parser (lxml)	`BeautifulSoup(markup, "lxml-xml")` `BeautifulSoup(markup, "xml")`	속도 매우 빠름 유일한 XML parser	C 의존성
html5lib (html5lib)	`BeautifulSoup(markup, "html5lib")`	호환성 매우 좋음 웹브라우저와 같은 방식으로 소스코드를 읽어들임 HTML5 코드 생성	속도 매우 느림 html5lib 라이브러리를 설치해야 사용 가능

정리하면,

왠만하면 lxml 을 사용하고,
정 안되면 html5lib 를 설치한 후 이걸 이용하는게 좋을 듯 합니다.

저작자표시 비영리 변경금지 (새창열림)