注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Ooi Beng Chin 黄铭钧

Databases, Machine Learning and Systems

 
 
 

日志

 
 

微博数据管理 Tweet Data Management  

2011-03-30 09:21:06|  分类: 默认分类 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |

    在社交网络和微博系统中,数据作为一个整体,都存在一个系 统里,因此,那些数据一旦产生就应当被搜索到。为了给博客以及微博提供实时搜索功能,我们需要即时地更新数据库以及索引。然而在微博系统中,这些微博产生的速度实在是太快了:在一些比较流行的系统中,用户们每天可能会发布超过

5000 万条微博。在用户量大的微博系统中,提供实时检索确实是一个很有挑战性的问题。在这种系统中,每秒中都会有数千条新的微博需要系统来处理,我们需要实时索引这些微博,并且提供一个有效且快速的查询方法。现有的搜索引擎通常在抓取网页之后,每隔一段时间更新一次索引。这种机制并不适合能够频繁产生新数据的微博系统。在微博系统中,索引的新鲜度以及检索结果的相关度依赖于内容能多快的被系统所获取。此外,微博的用户之间存在伙伴关系以及粉丝关系,这导致搜索更加复杂。微博的内容通常都很短,因此,如果考虑用户之间的关系,会更有益于理解一条微博的含义。也就是说,除了考虑关键字在某条微博中出现的频率,我们需要考虑更多的因素。在[1]中, 我们设计了一个实时索引的方法,并且在检索时考虑了用户之间的朋友关系以及微博之间的回复关系。

       总的来说,微博管理系统向我们提出了一些新的挑战。在未来几年里,我们将会看到微博用户群的快速增长,以及随之而来伴随

着的更加频繁的微博更新。

In social networking and microblogging systems, the data is stored within one system, and hence the search should be possible as soon as the data is produced.  To make a blog or tweet searchable as soon as it is produced, the database must be updated and the index must be updated in real time.  However,  the main

problem in the microblogging systems is the unprecedented amount of tweets or microblogs being posted each day; for example some popular tweet service providers handle
more than 50 million tweets per day. Providing real-time search service is indeed very challenging in large-scale microblogging systems. In such a system, thousands of new updates need to be processed per second. To make every update searchable, we need  to index its effect in real time and provide effective and efficient keyword-based retrieval at the same time.  The indexing strategies adopted by conventional search engines,  such as crawling the web pages and updating the index periodically, are not effective for systems that generate huge amount of contents very frequentlysuch as the microblogging systems.  There is no trigger set or management of  contents  in one location/system, and therefore, the freshness of the index and relevance of the contents  with respect to the search results would rely on the frequency in which contents are crawled.  Further, the problem in tweet searching is complicated by the presence of the user  and following/follower relationships.  A tweet, which is fairly short, only 140 character long, may carry more meaning than it is  if the relationships were to be considered.That is, we should consider beyond the presence of the terms or keywords in analysing the trend/pattern.  In [1], we propose to index tweets in real time and consider some of the user/tweet relationships during retrieval.

Tweets management presents different challenges, and we will see the growth of its popularity and therefore the increasing amount of tweet data to handle in years to come.

References:

[1] C. Chen, F. Li, B. C. Ooi, S. Wu: TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets. ACM Int'l. Conference on Management of Data (SIGMOD), 2011

  评论这张
 
阅读(1121)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017