还有一个问题是现在的页面渲染很多都使用js,这样的话在后端并没有办法处理,经过调查,发现了chrome headless
模式。
所谓chrome headless
就是让chrome以无头的模式在后台运行,然后我们可以借用他的api打印pdf(不仅仅是打印pdf,还可以截图,还有很多别的功能我了解的也不是很透彻)
开胃菜
如果你仅仅是想测试一下这个功能,可以通过命令行执行chrome文件,如下
chrome \ # 注意chrome是你的可执行软件,不同操作系统路径不同 --headless \ # Runs Chrome in headless mode. --disable-gpu \ # Temporarily needed if running on Windows. --remote-debugging-port=9222 \ https://www.chromestatus.com # URL to open. Defaults to about:blank.
这样就会把目标网站打印成pdf了,没测过,不知道行不行。
正餐
安装与启动
在 Chrome 安装完毕后我们可以利用其包体内自带的命令行工具启动:
$ chrome --headless --remote-debugging-port=9222 https://chromium.org
笔者为了部署方便,使用 Docker 镜像来进行快速部署,如果你本地存在 Docker 环境,可以使用如下命令快速启动:
docker run -d -p 9222:9222 justinribeiro/chrome-headless
Dockerfile
# 在容器运行一个 Chrome Headless # # What was once a container using the experimental build of headless_shell from # tip, this container now runs and exposes stable Chrome headless via # google-chome --headless. # # 最新消息 # # 1. Pulls from Chrome Stable # 2. You can now use the ever-awesome Jessie Frazelle seccomp profile for Chrome. # wget https://raw.githubusercontent.com/jfrazelle/dotfiles/master/etc/docker/seccomp/chrome.json -O ~/chrome.json # # # To run (without seccomp): # docker run -d -p 9222:9222 --cap-add=SYS_ADMIN justinribeiro/chrome-headless # # To run a better way (with seccomp): # docker run -d -p 9222:9222 --security-opt seccomp=$HOME/chrome.json justinribeiro/chrome-headless # # Basic use: open Chrome, navigate to http://localhost:9222/ # # Base docker image FROM debian:buster-slim LABEL name="chrome-headless" \ maintainer="Justin Ribeiro <justin@justinribeiro.com>" \ version="3.0" \ description="Google Chrome Beta Headless in a container" # Install deps + add Chrome Stable + purge all the things RUN apt-get update && apt-get install -y \ apt-transport-https \ ca-certificates \ curl \ gnupg \ --no-install-recommends \ && curl -sSL https://dl.google.com/linux/linux_signing_key.pub | apt-key add - \ && echo "deb https://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google-chrome.list \ && apt-get update && apt-get install -y \ google-chrome-beta \ fontconfig \ fonts-ipafont-gothic \ fonts-wqy-zenhei \ fonts-thai-tlwg \ fonts-kacst \ fonts-symbola \ fonts-noto \ fonts-freefont-ttf \ --no-install-recommends \ && apt-get purge --auto-remove -y curl gnupg \ && rm -rf /var/lib/apt/lists/* # Add Chrome as a user RUN groupadd -r chrome && useradd -r -g chrome -G audio,video chrome \ && mkdir -p /home/chrome && chown -R chrome:chrome /home/chrome \ && mkdir -p /opt/google/chrome-beta && chown -R chrome:chrome /opt/google/chrome-beta # Run Chrome non-privileged USER chrome # Expose port 9222 EXPOSE 9222 # Autorun chrome headless with no GPU ENTRYPOINT [ "google-chrome" ] CMD [ "--headless", "--disable-gpu", "--remote-debugging-address=0.0.0.0", "--remote-debugging-port=9222" ]
安装完访问 http://localhost:9222/json 得到如下
[ { "description": "", "devtoolsFrontendUrl": "/devtools/inspector.html?ws=192.168.0.56:9222/devtools/page/5110254FCED608FEA5F25F61751EE5ED", "id": "5110254FCED608FEA5F25F61751EE5ED", "title": "about:blank", "type": "page", "url": "about:blank", "webSocketDebuggerUrl": "ws://localhost:9222/devtools/page/5110254FCED608FEA5F25F61751EE5ED" } ]
如果是在 Mac 下本地使用的话我们还可以创建命令别名:
alias chrome="/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome" alias chrome-canary="/Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary" alias chromium="/Applications/Chromium.app/Contents/MacOS/Chromium"
如果是在 Ubuntu 环境下我们可以使用 deb 进行安装:
# Install Google Chrome # https://askubuntu.com/questions/79280/how-to-install-chrome-browser-properly-via-command-line sudo apt-get install libxss1 libappindicator1 libindicator7 wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb sudo dpkg -i google-chrome*.deb # Might show "errors", fixed by next line sudo apt-get install -f
chrome 命令行也支持丰富的命令行参数,--dump-dom
参数可以将 document.body.innerHTML
打印到标准输出中:
chrome --headless --disable-gpu --dump-dom https://www.chromestatus.com/
而 --print-to-pdf
标识则会将网页输出位 PDF:
chrome --headless --disable-gpu --print-to-pdf https://www.chromestatus.com/
初次之外,我们也可以使用 --screenshot
参数来获取页面截图:
chrome --headless --disable-gpu --screenshot https://www.chromestatus.com/ # Size of a standard letterhead. chrome --headless --disable-gpu --screenshot --window-size=1280,1696 https://www.chromestatus.com/ # Nexus 5x chrome --headless --disable-gpu --screenshot --window-size=412,732 https://www.chromestatus.com/
如果我们需要更复杂的截图策略,譬如进行完整页面截图则需要利用代码进行远程控制。
代码控制
启动
在上文中我们介绍了如何利用命令行来手动启动 Chrome,这里我们尝试使用 Node.js 来启动 Chrome,最简单的方式就是使用 child_process 来启动:
const exec = require('child_process').exec; function launchHeadlessChrome(url, callback) { // Assuming MacOSx. const CHROME = '/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome'; exec(`${CHROME} --headless --disable-gpu --remote-debugging-port=9222 ${url}`, callback); } launchHeadlessChrome('https://www.chromestatus.com', (err, stdout, stderr) => { ... });
远程控制
这里我们使用 chrome-remote-interface 来远程控制 Chrome ,实际上 chrome-remote-interface 是对于 Chrome DevTools Protocol 的远程封装,我们可以参考协议文档了解详细的功能与参数。使用 npm 安装完毕之后,我们可以用如下代码片进行简单控制:
const CDP = require('chrome-remote-interface'); CDP((client) => { // extract domains const {Network, Page} = client; // setup handlers Network.requestWillBeSent((params) => { console.log(params.request.url); }); Page.loadEventFired(() => { client.close(); }); // enable events then start! Promise.all([ Network.enable(), Page.enable() ]).then(() => { return Page.navigate({url: 'https://github.com'}); }).catch((err) => { console.error(err); client.close(); }); }).on('error', (err) => { // cannot connect to the remote endpoint console.error(err); });
我们也可以使用 chrome-remote-interface 提供的命令行功能,譬如我们可以在命令行中访问某个界面并且记录所有的网络请求:
$ chrome-remote-interface inspect >>> Network.enable() { result: {} } >>> Network.requestWillBeSent(params => params.request.url) { 'Network.requestWillBeSent': 'params => params.request.url' } >>> Page.navigate({url: 'https://www.wikipedia.org'}) { 'Network.requestWillBeSent': 'https://www.wikipedia.org/' } { result: { frameId: '5530.1' } } { 'Network.requestWillBeSent': 'https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia_wordmark.png' } { 'Network.requestWillBeSent': 'https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png' } { 'Network.requestWillBeSent': 'https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-3b68787aa6.js' } { 'Network.requestWillBeSent': 'https://www.wikipedia.org/portal/wikipedia.org/assets/js/gt-ie9-c84bf66d33.js' } { 'Network.requestWillBeSent': 'https://www.wikipedia.org/portal/wikipedia.org/assets/img/sprite-bookshelf_icons.png?16ed124e8ca7c5ce9d463e8f99b2064427366360' } { 'Network.requestWillBeSent': 'https://www.wikipedia.org/portal/wikipedia.org/assets/img/sprite-project-logos.png?9afc01c5efe0a8fb6512c776955e2ad3eb48fbca' }
我们也可以直接查看内置的接口文档:
>>> Page.navigate { [Function] category: 'command', parameters: { url: { type: 'string', description: 'URL to navigate the page to.' } }, returns: [ { name: 'frameId', '$ref': 'FrameId', hidden: true, description: 'Frame id that will be navigated.' } ], description: 'Navigates current page to the given URL.', handlers: [ 'browser', 'renderer' ] }>>> Page.navigate { [Function] category: 'command', parameters: { url: { type: 'string', description: 'URL to navigate the page to.' } }, returns: [ { name: 'frameId', '$ref': 'FrameId', hidden: true, description: 'Frame id that will be navigated.' } ], description: 'Navigates current page to the given URL.', handlers: [ 'browser', 'renderer' ] }
我们在上文中还提到需要以代码控制浏览器进行完整页面截图,这里需要利用 Emulation 模块控制页面视口缩放:
const CDP = require('chrome-remote-interface'); const argv = require('minimist')(process.argv.slice(2)); const file = require('fs'); // CLI Args const url = argv.url || 'https://www.google.com'; const format = argv.format === 'jpeg' ? 'jpeg' : 'png'; const viewportWidth = argv.viewportWidth || 1440; const viewportHeight = argv.viewportHeight || 900; const delay = argv.delay || 0; const userAgent = argv.userAgent; const fullPage = argv.full; // Start the Chrome Debugging Protocol CDP(async function(client) { // Extract used DevTools domains. const {DOM, Emulation, Network, Page, Runtime} = client; // Enable events on domains we are interested in. await Page.enable(); await DOM.enable(); await Network.enable(); // If user agent override was specified, pass to Network domain if (userAgent) { await Network.setUserAgentOverride({userAgent}); } // Set up viewport resolution, etc. const deviceMetrics = { width: viewportWidth, height: viewportHeight, deviceScaleFactor: 0, mobile: false, fitWindow: false, }; await Emulation.setDeviceMetricsOverride(deviceMetrics); await Emulation.setVisibleSize({width: viewportWidth, height: viewportHeight}); // Navigate to target page await Page.navigate({url}); // Wait for page load event to take screenshot Page.loadEventFired(async () => { // If the `full` CLI option was passed, we need to measure the height of // the rendered page and use Emulation.setVisibleSize if (fullPage) { const {root: {nodeId: documentNodeId}} = await DOM.getDocument(); const {nodeId: bodyNodeId} = await DOM.querySelector({ selector: 'body', nodeId: documentNodeId, }); const {model: {height}} = await DOM.getBoxModel({nodeId: bodyNodeId}); await Emulation.setVisibleSize({width: viewportWidth, height: height}); // This forceViewport call ensures that content outside the viewport is // rendered, otherwise it shows up as grey. Possibly a bug? await Emulation.forceViewport({x: 0, y: 0, scale: 1}); } setTimeout(async function() { const screenshot = await Page.captureScreenshot({format}); const buffer = new Buffer(screenshot.data, 'base64'); file.writeFile('output.png', buffer, 'base64', function(err) { if (err) { console.error(err); } else { console.log('Screenshot saved'); } client.close(); }); }, delay); }); }).on('error', err => { console.error('Cannot connect to browser:', err); });
php 部分
我们想跟后台运行的浏览器通讯,必须借助websocket
的帮助,实际上是调用Chrome DevTools Protocol的api完成。
php 的websocket我选用了
"textalk/websocket": "1.0.*"
获取
webSocketDebuggerUrl
,这个非常重要,我们要拿着这个url通过websocket跟浏览器通讯
$curlCmd = "curl -s 127.0.0.1:9222/json"; do { $json = json_decode(shell_exec($curlCmd)); } while (empty($json)); $endpoint = $json[0]->webSocketDebuggerUrl;获取到的response是这样滴
[ { "description": "", "devtoolsFrontendUrl": "/devtools/inspector.html?ws=localhost:9222/devtools/page/76B74DBA491F065BDD6097DD81F5514C", "id": "76B74DBA491F065BDD6097DD81F5514C", "title": "about:blank", "type": "page", "url": "about:blank", "webSocketDebuggerUrl": "ws://localhost:9222/devtools/page/76B74DBA491F065BDD6097DD81F5514C" } ]
跟浏览器通讯,打印页面,代码中传输的参数都来自这里Chrome DevTools Protocol
``` // websocket $client = new Client($endpoint); $client->send(json_encode([ 'id' => 1, "method" => 'Page.enable', ])); $client->send(json_encode([ 'id' => 2, "method" => 'Page.navigate', "params" => ['url' => "https://test.com"] ])); $frameId = null; while ($data = json_decode($client->receive())) { // 判断网页是否打开 if (@$data->id == 2) { $frameId = $data->result->frameId; } // 判断网页是否停止加载 if (@$data->method == 'Page.frameStoppedLoading' && @$data->params->frameId == $frameId) { $client->send(json_encode([ 'id' => 3, "method" => 'Page.printToPDF', ])); } // 获取结果 if (@$data->id == 3) { file_put_contents("test.pdf", base64_decode($data->result->data)); break; } } ```
饭(坑)后(坑)甜(坑)点
打印的网页中文乱码,原因是chrome运行的环境没有中文支持,需要安装中文字体
apt-get update && apt-get install -y fonts-wqy-zenhei
所打印的页面中css
background-color
打印不出来,需要在css中添加-webkit-print-color-adjust: exact;
.
出处
结束语
不得不说chrome的打印非常牛逼,完美重现你的画面,js渲染的页面也可打印,哈哈哈,因为人家是浏览器啊~~~,天生支持。写的过程中还是有一个问题没解决的,react渲染页面需要时间,后台访问你的页面时有的还没加载完,打印会缺失一部分,我也不得不在后台写一个sleep()
解决,大家要是有什么好的办法可以留言讨论~~~
已有1位网友发表了看法: