Brown Dog deep web search

Brown Dog digs into the deep, dark web

Unstructured data is the bane of researchers everywhere. Although casual Googlers may be frustrated by not being able to open online files, researchers often need to dig into data trapped in outdated formats and uncurated collections with little or no metadata. And according to IDC, up to 90 percent of big data is "dark," meaning the contents of such files cannot be easily accessed.

Thus, the Brown Dog solution to a long-tail problem.

Led by Kenton McHenry and Jong Lee of the Image and Spatial Data Analysis division at the National Center for Supercomputing Application (NCSA) at the University of Illinois at Urbana-Champaign, Brown Dog seeks to develop a service that will make uncurated data accessible. 

"The information age has made it easy for anyone to create and share vast amounts of digital data, including unstructured collections of images, video and audio as well as documents and spreadsheets," said McHenry. "But the ability to search and use the contents of digital data has become exponentially more difficult."

Brown Dog is working to change that. Recipients in 2013 of a $10 million, five-year award from the National Science Foundation, the UI team recently demonstrated two services to make the contents of uncurated data collections accessible.

The first, called Data Access Proxy (DAP), transforms unreadable files into readable ones by linking together a series of computing and translational operations behind the scenes.

How Brown Dog deep web search tools work

Similar to an Internet gateway, the configuration of the DAP would be entered into a user's machine settings. Thereafter, data requests over HTTP would first be examined by the proxy to determine if the native file format is readable on the client device.

If not, the DAP would be called in the background to convert the file into the best possible format readable by the client machine.

The second tool, the Data Tilling Service (DTS), lets individuals search collections of data, using an existing file to discover similar files in the data. For example, while browsing an online image collection, a user could drop an image of three people into the search field, and the DTS would return images in the collection that also contain three people.

If the DTS encounters a file format it is unable to parse, it would use the Data Access Proxy to make the file accessible. It also indexes the data and extracts and appends metadata to files to give users a sense of the type of data they are encountering.

Rather than starting from scratch and constructing a single all-encompassing piece of software, the NCSA team is building on previous software development work. The project aims to bring together every possible source of automated help already in existence. By patching together such components, they plan to make Brown Dog the "super mutt" of software, according to the NSF release.

"Brown Dog today is developing a 'time machine' set of cyberinfrastructure tools, software and services that respond to the long-standing aspiration of many scientific, research and educational communities to effectively access, share and apply digital data and information originating in diverse sources and legacy environments in order to advance contemporary science, research and education," said Robert Chadduck, the program director at NSF who oversees the award.

Brown Dog isn't only useful for searching the deep web, either. McHenry says the Brown Dog software suite could one day be used to help individuals manage ever-growing collections of photos, videos and unstructured/uncurated data on the Web.

"Being at the University of Illinois and NCSA many of us strive to create something that will live on to have the broad impact that the NCSA Mosaic Web browser did," McHenry said, referring to the world's first web browser, which was developed at NCSA.

"It is our hope that Brown Dog will serve as the beginnings of yet another such indispensible component for the Internet of tomorrow."

About the Author

Connect with the GCN staff on Twitter @GCNtech.


  • Records management: Look beyond the NARA mandates

    Pandemic tests electronic records management

    Between the rush enable more virtual collaboration, stalled digitization of archived records and managing records that reside in datasets, records management executives are sorting through new challenges.

  • boy learning at home (Travelpixs/

    Tucson’s community wireless bridges the digital divide

    The city built cell sites at government-owned facilities such as fire departments and libraries that were already connected to Tucson’s existing fiber backbone.

Stay Connected