<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>邪罗刹的菠萝阁 &#187; flickr</title>
	<atom:link href="http://www.rainmoe.com/tag/flickr/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.rainmoe.com</link>
	<description>One code, one world ...</description>
	<lastBuildDate>Thu, 29 Dec 2011 14:04:21 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>如何抓取Flickr相片集中的图片URL</title>
		<link>http://www.rainmoe.com/2010/01/10/get-the-files-and-urls-from-flickr/</link>
		<comments>http://www.rainmoe.com/2010/01/10/get-the-files-and-urls-from-flickr/#comments</comments>
		<pubDate>Sat, 09 Jan 2010 18:13:45 +0000</pubDate>
		<dc:creator>小邪</dc:creator>
				<category><![CDATA[作品 [Work]]]></category>
		<category><![CDATA[curl]]></category>
		<category><![CDATA[flickr]]></category>
		<category><![CDATA[php]]></category>

		<guid isPermaLink="false">http://www.evlos.org/?p=1893</guid>
		<description><![CDATA[> 最近小邪准备把博客的图片地址都换成 Flickr 上面的图片地址，所以需要用抓取来节省时间。

> 恩恩，纯粹是节省时间，抓取对很多盆友都是不好的事情，小邪的站也被抓过，很杯具。



<span class="readmore"><a href="http://www.rainmoe.com/2010/01/10/get-the-files-and-urls-from-flickr/" title="如何抓取Flickr相片集中的图片URL">阅读全文——共3124字</a></span>]]></description>
			<content:encoded><![CDATA[<p>> 最近小邪准备把博客的图片地址都换成 Flickr 上面的图片地址，所以需要用抓取来节省时间。<br />
> 恩恩，纯粹是节省时间，抓取对很多盆友都是不好的事情，小邪的站也被抓过，很杯具。</p>
<p><img src='http://www.rainmoe.com/wp-content/uploads/old/Capture1105.jpg' /></p>
<p>> 那么，在这里小邪就讲解一下如何使用 PHP 的 CURL 函数和正则式抓取 Flickr 的图片。<br />
> 首先用 Curl 带 Cookies 地抓页面代码，然后用正则分离出图片Code，最后得到大尺寸地址。</p>
<p><span id="more-1893"></span>> Rencently, Evlos prepare to change the image urls in my article to the image urls in Flickr.<br />
> So, I think I should use php program to catch urls in Flickr for saving time.<br />
> Um .. Just for saving time, curl isn't a good tools for lots of people.<br />
> Because it can get the whole data in website and make a mirror site for gaining money.<br />
> And I suffered this last month, it's a tragedy.</p>
<p>> First of all, use curl get the html data. We should post some cookies in same time.<br />
> Then use regex to get the code for images. At last, we use image code to get image url.</p>
<p><strong>一. 分析一下地址 Analyse url：</strong></p>
<p><strong>1. 用户地址 User url -</strong></p>
<p>> http://www.flickr.com/photos/46051661@N04<br />
> 比如小邪的用户地址就是这个样子的，很规则，处理有规则的东西是最方便的鸟。</p>
<p>> For example, this is my album url. It's very regular and can easy to deal with it.</p>
<p><strong>2. 相片集地址 Album url -</strong></p>
<p>> http://www.flickr.com/photos/46051661@N04/sets/72157623167782492<br />
> 恩恩，后面是一个 Sets 表示相片集，然后是相片集本身的 Code。<br />
> 小邪喜欢把这些唯一性的字符称为 Code，呵呵，这样比较好说一点儿。</p>
<p>> Um .. It contains a sets code rearward and a user code in der mitte.<br />
> Evlos like called the unique character as code. Haha, it's easily explained.</p>
<p><strong>3. 单张相片地址 Single image url -</strong></p>
<p>> http://www.flickr.com/photos/46051661@N04/4259923860/<br />
> http://www.flickr.com/photos/46051661@N04/4259923860/in/set-72157623167782492<br />
> 嘎嘎，有两种，其实都是一模一样的页面来着，所以咱挑上面的短的。</p>
<p>> <img src='/apps/smiles/icon_smile.gif' alt=':)' class='wp-smiley' /> , Flickr offer two kinds of url, but it will heading us to a same page.<br />
> So, certainly, we choose the shorter url.</p>
<p><strong>4. 单张相片大尺寸地址 Single image url for big size -</strong></p>
<p>> http://www.flickr.com/photos/46051661@N04/4259923860/sizes/o/<br />
> 一般来说小邪的 600px 宽度，高度在 600px 以下的，都用地址 O 来查看全图的。<br />
> http://www.flickr.com/photos/46051661@N04/4259923860/sizes/l/<br />
> 因为 Flickr 不提供大图全图，而 L 是图片尺寸过大后被裁减的地址，所以只好用 L 咯。<br />
> 嘎嘎，还有四个尺寸，依次减小，这样子 - M S T SQ，OK可以开工了。</p>
<p>> In general, my image is limit in 600px and i can get the full size by "o".<br />
> Bacause Flickr limit the size of image by 1024px for free users.<br />
> And "L" is a code for the image exceed 1024px and offer 1024px image.<br />
> Haha, and the remaining four size. Like M S T SQ, so let's beginning. </p>
<p><strong>二. 开始抓取 Begin to catch：</strong></p>
<p><strong>1. 抓取相片 Code 代码 Catch the code of image：</strong></p>
<p>> $sa[0] 里面储存的是相片的 Code，$sa[1] 储存的是相片的标题。<br />
> 而 $sa[2][0] 储存的是相片个数，因为这里是二维数组，小邪不想要 Foreach。<br />
> 虽然双层 Foreach 可以遍历二维数组，不过这里只需要作为两个一维数组就好。</p>
<p>> Put the codes of image in $sa[0]. And put the title of image in $sa[1].<br />
> And put the numbers of images into $sa[2][0], bacause it's 2d array.<br />
> And Evlos don't want to use foreach. Though i can use double-layer foreach.<br />
> I just need to use two 1d array, it's enough.</p>
<pre class="brush: php; auto-links: false; html-script: false; title: ; notranslate">
function app_get_set_info($data) {
	$regex = &quot;%\/photos\/46051661@N04\/(\d+)\/in\/set\-\d+\/\&quot;
title=\&quot;([a-z0-9A-Z-_]*)\&quot; class%i&quot;;
	preg_match_all($regex,$data,$save);
	$sa[0] = $save[1];
	$sa[1] = $save[2];
	$sa[2][0] = array_count($save[1]);
	return $sa;
}
</pre>
<p>> $save[1] 是储存第一个括号中匹配内容的数组，而 $save[2] 则是第二个括号的。<br />
> 那么还有一个，是 $save[0]，这个当然就是整串正则式匹配的字符咯，O(∩_∩)O。</p>
<p>> We put the content in firstly bracket into $save[1].<br />
> And the same meaning to content in secendly bracket.<br />
> So, the $save[0] is use to putting the whole content that get by regex.</p>
<p><strong>2. 抓取图片地址：</strong></p>
<p>> 恩，这里整个页面也就一张 JPG 或者 PNG 的大图了（页面元素是 GIF）。<br />
> 所以咱们这样子抓下来。╮(╯▽╰)╭，可怜的 Flickr，被偶剥得半裸了。<br />
> 嘎嘎，差不多就 619 那根全裸男一样了（619 童鞋一定要小邪给他开个裸奔帝国<a target='_blank' rel='nofollow' href='http://liuyijun.com/'>传送门</a>）。</p>
<p>> Um .. The whole image page is just include one jpg or one png url.<br />
> So we can easily get it by regex like the following content.</p>
<pre class="brush: php; auto-links: false; html-script: false; title: ; notranslate">
function app_get_ourl($data) {
	$pagelist_regex = &quot;%&lt;img src=\&quot;(.+.jpg)\&quot;%i&quot;;
	preg_match_all($pagelist_regex,$data,$save);
	//print_r($save);
	if (!isset($save[1][0])) {
		$pagelist_regex = &quot;%&lt;img src=\&quot;(.+.png)\&quot;%i&quot;;
		preg_match_all($pagelist_regex,$data,$save);
	}
	return $save[1];
}
</pre>
<p><strong>3. 带 Cookies 的 Curl：</strong></p>
<p>> 因为登陆后在相片集页面可以看到全部照片，所以咱们用 Curl 把  Cookies 传过去。<br />
> 嘎嘎，老样子，大家都喜欢模拟 FF 的访问头部。然后是一些必要的参数。</p>
<p>> Because if we login, we can see the all images in set page.<br />
> So we post the cookies to Flickr, and get the html code.<br />
> Haha .. Same as ever, we like simulate the header of firefox.</p>
<pre class="brush: php; auto-links: false; html-script: false; title: ; notranslate">
function app_get_html($url,$cookie='') {
	$curl = curl_init($url);
	$useragent=&quot;Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1&quot;;
	curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
	curl_setopt ($curl, CURLOPT_USERAGENT, $useragent);
	if ($cookie&lt;&gt;'') {
		curl_setopt ($curl, CURLOPT_COOKIE, $cookie);
	}
	$data = curl_exec($curl);
	curl_close($curl);
	return $data;
}
// 用法如下，Cookies 信息麻烦自行找到，小邪太懒了╮(╯▽╰)╭。
// The example method like the following content. Please get the cookies by yourself.
app_get_html('http://www.flickr.com/photos/46051661@N04/sets/72157623167782492',
$cookie='cookie_accid=16212532;cookie_epass=816e23c7b24aa6q9f13713e7503de07f;')
</pre>
<p><strong>4. 程序运行过程 The process of running：</strong></p>
<p>> 首先麻烦自行搞到 Flickr 的 Cookies，然后把相片集页面包含的相片 Code 全部抓取来。<br />
> 保存到数据库之类的地方（因为咱们可能会经常超时），一条一条读取数据库中保存的 Code。</p>
<p>> 然后获取图片页面中的 Url，保存到数据库，如果 Url 已经储存就不去抓取了。<br />
> 因为 100% 会出现超时，所以得这样，然后到时候刷新下接着干就好了，嘿嘿。</p>
<p>> 请原谅小邪没有把完整源代码贴出来，因为怕引起 Flickr 官方的注意，虽然可能性不会很大。<br />
> 但是还是小心一点儿为好。而且全部的主要代码已经贴出来了，储存数据库相信你能搞定的。<br />
> 呵呵，时间又到两点多钟了，小邪很想睡觉鸟 Zzzzzzzzzz 晚安喔，小邪这就去把床给上了。</p>
<p>> First of all, get the cookies for yourself. Then get the codes for images.<br />
> Save them into database or something like that. And read data one by one.</p>
<p>> Get the url for single image file and save to db, if it's exist, just skip up.<br />
> Beacuse we will reach the time excceed. So just need to refresh the page.</p>
<p><strong>三. 这难道是水军路过？！：</strong></p>
<p><img src='http://www.rainmoe.com/wp-content/uploads/old/Capture1104.jpg' /></p>
<p>> 截图留念，╮(╯▽╰)╭，人家都是拍照留念，但小邪没事最喜欢截图留念了。</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rainmoe.com/2010/01/10/get-the-files-and-urls-from-flickr/feed/</wfw:commentRss>
		<slash:comments>117</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using memcached (Feed is rejected)
Page Caching using memcached
Database Caching 1/5 queries in 0.002 seconds using memcached
Object Caching 217/217 objects using memcached

Served from: www.rainmoe.com @ 2012-02-09 17:00:20 -->
