Category Archives: 12-datascraping

maddy

27 Jan 2015

originally, my general goal was to steal all the blingees.
i got pretty far, & reverse engineered the public-facing blingee app
i have a lot of keys now
but then relaized i know nothing of xml calls
so we’re working on that
#whoops

but here’s some preview images i scraped as a prelim test:
41638283_1763aa12
 140191486_628270
439193309_248339
607417843_741179
834284893_1089737
842845040_1805529

 

~~**~~**¢ðÐê ¢ðmïñg §ððñ*~*~**

 

dsrusso

27 Jan 2015

This data scraper pulls an extensive xml profile for every business in a given area code that is registered to produce toxic materials.  The data obtained varies from latitude and longitude, to contact emails on file with the government. (See Sample Below)

from temboo.Library.EnviroFacts.Toxins import FacilitiesSearchByZip
from temboo.core.session import TembooSession
import time 

file = open("test.xml", "w")

# Create a session with your Temboo account details
session = TembooSession(***********, "myFirstApp", "******************")

# Instantiate the Choreo
facilitiesSearchByZipChoreo = FacilitiesSearchByZip(session)

# Get an InputSet object for the Choreo
facilitiesSearchByZipInputs = facilitiesSearchByZipChoreo.new_input_set()

counter =0

for i in range (15201 , 15295):
	counter += 1
# Set the Choreo inputs
	facilitiesSearchByZipInputs.set_Zip(i)

# Execute the Choreo
	facilitiesSearchByZipResults = facilitiesSearchByZipChoreo.execute_with_results(facilitiesSearchByZipInputs)

# # Print the Choreo outputs
# print("Response: " + facilitiesSearchByZipResults.get_Response())
	file.write(facilitiesSearchByZipResults.get_Response())
	file.write("\n")
	print counter, "zip:",i
	time.sleep(30)



file.close()

 

The end goal of using this data is to created city maps that visualize concentrations of possible toxic contamination. The next step i’d like to add to this process is cross referencing these concentrations with average income levels from the Census Bureau.


< ?xml version="1.0"?>

 
  15201BRBRSNEMCC
  BARBER SPRING
  ONE MCCANDLESS AVE
  PITTSBURGH
  ALLEGHENY
  42003
  PA
  15201
  3
  0
  BARBER SPRING
  ONE MCCANDLESS AVE
  PITTSBURGH
  PA
  
  
  15201-
  C
  
  
  NA
  WABTEC CORP
  402901
  795717
  
  
  
  
  
  
  
  
  0
  DAVID JABLONOWSKI
  4127827316
  
  DJABLONOWSKI@WABTEC.COM
  4127827316
  DJABLONOWSKI@WABTEC.COM
  
  WABTEC CORP
  
   
    1306204294351
    1
    15201BRBRSNEMCC
    007440473
    L
    2006
    DAVID JABLONOWSKI
    VICE PRESIDENT OF OPERATIONS
    0
    1
    03
    O
    01-JUN-07
    12-JUN-07
    DAVID JABLONOWSKI
    4127827316
    1
    
    1
    0
    1
    NA
    14-JUN-07
    0
    12-JUN-07
    14-JUN-07
    0
    0
    NA
    CHROMIUM
    0
    0
    0
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    M
   
   
    1309208100141
    1

Yeliz Karadayi

27 Jan 2015

I posted my looking out about expressive data but to be quite honest I find myself to be a much more pragmatic person, and I couldn’t think of something I was truly intrigued by enough to attempt to scrape a lot of data and potentially have it not visualize in a way that anybody gains anything from it. I couldn’t think of something I cared about enough, or felt any kind of emotional reaction. On the other hand I did find myself scrolling through some torrent websites and noticing how some sites are more popular for certain file types or cultures, so I thought I’d look into that. Nice, pragmatic, and helpful to know where to go for what files. I’d actually like to expand beyond torrents and get into other kinds of file sharing if I can figure out how to get access to them.

Not only will scraping torrent sites expose what sites are best for what files, but also what files are popular at what times. This is essentially a map of what’s big in digital technology if you think about it. Torrents make available movies, software, books, music, and video games. This can be a map of what’s popular in general. The post dates allow for an understanding of times and the rise and fall of popularity in the various files. I think this will also show cultural trends, potentially. Either way there is a lot of potential in my opinion due to the large variety available from several very nicely organized websites.

{
  "name": "The Pirate Bay Top Torrents",
  "count": 40,
  "frequency": "Weekly",
  "version": 1,
  "newdata": true,
  "lastrunstatus": "success",
  "lastsuccess": "Tue Jan 27 2015 13:33:22 GMT+0000 (UTC)",
  "thisversionstatus": "success",
  "nextrun": "Tue Feb 03 2015 13:33:22 GMT+0000 (UTC)",
  "thisversionrun": "Tue Jan 27 2015 13:33:22 GMT+0000 (UTC)",
  "results": {
    "collection1": [
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/499535/Consolevania-S02E06-Video-Gaming-Show",
          "text": "Consolevania S02E06 - Video Gaming Show"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "8 years",
        "Size": "330.84 MB",
        "Seeders": "6",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3046612/Otakutrad-it-Full-Metal-Alchemist-Hagaren-05-HD-720p-XviD-mp3-avi",
          "text": "[Otakutrad it] Full Metal Alchemist Hagaren 05(HD 720p-XviD mp3) avi"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=1&age=0",
          "text": "Anime"
        },
        "Age": "5 years",
        "Size": "405.64 MB",
        "Seeders": "1",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3116711/Ministry-of-Sound-Hard-NRG-Mixes-Volume-VII",
          "text": "Ministry of Sound Hard NRG Mixes - Volume VII"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=6&age=0",
          "text": "Music"
        },
        "Age": "5 years",
        "Size": "290.52 MB",
        "Seeders": "4",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3136850/Yuurisan-Subs-Higepiyo-10-h264-0B154A55-mkv",
          "text": "(Yuurisan-Subs) Higepiyo - 10 (h264)(0B154A55) mkv"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=1&age=0",
          "text": "Anime"
        },
        "Age": "5 years",
        "Size": "40.03 MB",
        "Seeders": "25",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3261189/Coraline-2009-DvdScreener-0725-wmv",
          "text": "Coraline[2009][DvdScreener]0725.wmv"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "5 years",
        "Size": "703.93 MB",
        "Seeders": "0",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3557711/Bakugan-New-Vestroia-28-Revenge-of-the-Vexos-DW-Umi-avi",
          "text": "Bakugan New Vestroia - 28 - Revenge of the Vexos (DW-Umi) avi"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=1&age=0",
          "text": "Anime"
        },
        "Age": "5 years",
        "Size": "175.19 MB",
        "Seeders": "35",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/4922068/Akuma-Fate-Subs-Mahou-Shoujo-Lyrical-Nanoha-The-Movie-1st-1920x1080-x264-DTS-Sub-Ita-mkv",
          "text": "[Akuma & Fate-Subs] Mahou Shoujo Lyrical Nanoha The Movie 1st (1920x1080 x264 DTS Sub Ita).mkv"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=1&age=0",
          "text": "Anime"
        },
        "Age": "4 years",
        "Size": "4.38 GB",
        "Seeders": "16777215",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/5841074/Linux-Mint-13-KDE-64-bit",
          "text": "Linux Mint 13 KDE (64-bit)"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=2&age=0",
          "text": "Software"
        },
        "Age": "2 years",
        "Size": "892 MB",
        "Seeders": "65",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/5841075/Linux-Mint-13-KDE-32-bit",
          "text": "Linux Mint 13 KDE (32-bit)"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=2&age=0",
          "text": "Software"
        },
        "Age": "2 years",
        "Size": "915 MB",
        "Seeders": "77",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/5993736/Novicorp-WinToFlash-0-7-0009-beta",
          "text": "Novicorp WinToFlash 0.7.0009 beta"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=2&age=0",
          "text": "Software"
        },
        "Age": "2 years",
        "Size": "6.84 MB",
        "Seeders": "1534",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/5998919/linux-mint-14-cinnamon-64-bit",
          "text": "linux mint 14 cinnamon [64 bit]"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=2&age=0",
          "text": "Software"
        },
        "Age": "2 years",
        "Size": "881 MB",
        "Seeders": "38",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/5998920/linux-mint-14-cinnamon-32-bit",
          "text": "linux mint 14 cinnamon [32 bit]"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=2&age=0",
          "text": "Software"
        },
        "Age": "2 years",
        "Size": "884 MB",
        "Seeders": "22",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/5998922/linux-mint-14-mate-64-bit",
          "text": "linux mint 14 mate [64 bit]"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=2&age=0",
          "text": "Software"
        },
        "Age": "2 years",
        "Size": "968 MB",
        "Seeders": "35",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/5998924/Linux-Mint-14-mate-32-bit",
          "text": "Linux Mint 14 mate [32 bit]"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=2&age=0",
          "text": "Software"
        },
        "Age": "2 years",
        "Size": "966 MB",
        "Seeders": "25",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/6044126/Rise-of-the-Guardians-2012-DvdRip-Xvid-UnKnOwN-FR-SUB",
          "text": "Rise of the Guardians 2012} DvdRip Xvid UnKnOwN [FR-SUB]"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "2 years",
        "Size": "702.2 MB",
        "Seeders": "0",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/6097316/linuxmint-14-xfce-dvd-32-bit-ISO",
          "text": "linuxmint 14 xfce dvd 32-bit ISO"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=2&age=0",
          "text": "Software"
        },
        "Age": "2 years",
        "Size": "872 MB",
        "Seeders": "106",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/6099470/Pinguy-OS-12-04-shell-i686-iso",
          "text": "Pinguy OS 12 04 shell i686 iso"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=2&age=0",
          "text": "Software"
        },
        "Age": "2 years",
        "Size": "1.65 GB",
        "Seeders": "16777215",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/6447434/Kevin-Hearne-Hunted-The-Iron-Druid-Chronicles-Book-Six",
          "text": "Kevin Hearne - Hunted (The Iron Druid Chronicles, Book Six)"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=9&age=0",
          "text": "Books"
        },
        "Age": "1 year",
        "Size": "2.47 MB",
        "Seeders": "239",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/6760382/linuxmint-201109-gnome-dvd-64bit-iso",
          "text": "linuxmint-201109-gnome-dvd-64bit.iso"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=2&age=0",
          "text": "Software"
        },
        "Age": "1 year",
        "Size": "1.1 GB",
        "Seeders": "18",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/11243775/TUcaptions-JP-Inari-Konkon-Koi-Iroha-01-10-v2-BIG5-TV-720P-MP4",
          "text": "[TUcaptions-JP][Inari Konkon Koi Iroha][01~10][v2][BIG5][TV-720P-MP4]"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=1&age=0",
          "text": "Anime"
        },
        "Age": "9 months",
        "Size": "1.95 GB",
        "Seeders": "19",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/11254039/TUcaptions-3C-Mikakunin-de-Shinkoukei-01-12-BIG5-TV-1080P-MKV",
          "text": "[TUcaptions-3C][Mikakunin de Shinkoukei][01~12][BIG5][TV-1080P-MKV]"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=8&age=0",
          "text": "Series & tv"
        },
        "Age": "9 months",
        "Size": "5.49 GB",
        "Seeders": "16806",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/11266331/TUcaptions-3C-Mikakunin-de-Shinkoukei-01-12-END-final-TV-720P-BIG5",
          "text": "[TUcaptions-3C][Mikakunin de Shinkoukei][01~12(END)][final][TV-720P][BIG5]"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=1&age=0",
          "text": "Anime"
        },
        "Age": "9 months",
        "Size": "2.07 GB",
        "Seeders": "86",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/12946431/David-Letterman-2008-07-30-Pamela-Anderson-HDTV-XVID-BAJSKORV",
          "text": "David Letterman 2008 07 30 Pamela Anderson HDTV XVID-BAJSKORV"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=8&age=0",
          "text": "Series & tv"
        },
        "Age": "2 months",
        "Size": "350.03 MB",
        "Seeders": "1",
        "Leechers": "16777215"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/6467513/V-For-Vendetta",
          "text": "V For Vendetta"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "1 year",
        "Size": "708.19 MB",
        "Seeders": "2293236",
        "Leechers": "8613246"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3171744/knowing-2009-DVDRip-FXG-FxM-XviD-aXXo",
          "text": "knowing 2009 DVDRip FXG FxM XviD-aXXo"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "5 years",
        "Size": "750.12 KB",
        "Seeders": "9444904",
        "Leechers": "5543610"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3169665/Up-2009-DVDRip-FXG-FxM-XviD-aXXo",
          "text": "Up 2009 DVDRip FXG FxM XviD-aXXo"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "5 years",
        "Size": "775.25 KB",
        "Seeders": "7983485",
        "Leechers": "5330660"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3174573/knowing-2009-DVDRip-FXG-FxM-XviD-aXXo",
          "text": "knowing 2009 DVDRip FXG FxM XviD-aXXo"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "5 years",
        "Size": "767.46 KB",
        "Seeders": "9479713",
        "Leechers": "4795168"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3172041/confessions-of-a-shopaholic-2009-DVDRip-FXG-FxM-XviD-aXXo",
          "text": "confessions of a shopaholic 2009 DVDRip FXG FxM XviD-aXXo"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "5 years",
        "Size": "757.92 KB",
        "Seeders": "8893465",
        "Leechers": "4509796"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3169670/Last-Chance-Harvey-2009-DVDRip-FXG-FxM-XviD-aXXo",
          "text": "Last Chance Harvey 2009 DVDRip FXG FxM XviD-aXXo"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "5 years",
        "Size": "815.91 KB",
        "Seeders": "9836975",
        "Leechers": "3530824"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3172061/Night-at-the-Museum-Battle-of-the-Smithsonian-2009-DVDRip-FXG-FxM-XviD-aXXo",
          "text": "Night at the Museum Battle of the Smithsonian 2009 DVDRip FXG FxM XviD-aXXo"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "5 years",
        "Size": "750.12 KB",
        "Seeders": "7460168",
        "Leechers": "3309542"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3172030/gran-torino-2009-DVDRip-FXG-FxM-XviD-aXXo",
          "text": "gran torino 2009 DVDRip FXG FxM XviD-aXXo"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "5 years",
        "Size": "775.25 KB",
        "Seeders": "8006257",
        "Leechers": "3028472"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3842331/Avatar-2009-PROPER-REPACK-BDrip-XviD-ORC",
          "text": "Avatar.2009.PROPER.REPACK.BDrip.XviD-ORC"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "4 years",
        "Size": "728.44 MB",
        "Seeders": "2305838",
        "Leechers": "2360813"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3730500/Avatar-2009-PROPER-REPACK-BDrip-XviD-ORC",
          "text": "Avatar.[2009].PROPER.REPACK.BDrip.XviD-ORC"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "4 years",
        "Size": "728.44 MB",
        "Seeders": "2099163",
        "Leechers": "2329682"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3838283/Green-Zone-2010-R5-LiNE-XviD-iMAGiNE",
          "text": "Green Zone 2010 R5 LiNE XviD-iMAGiNE"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "4 years",
        "Size": "699.41 MB",
        "Seeders": "2318114",
        "Leechers": "2268544"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3726526/Avatar-2009-PROPER-REPACK-BDrip-XviD-ORC",
          "text": "Avatar.[2009].PROPER.REPACK.BDrip.XviD-ORC"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "4 years",
        "Size": "728.44 MB",
        "Seeders": "2046653",
        "Leechers": "2254780"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3729131/Avatar-2009-PROPER-REPACK-BDrip-XviD-ORC",
          "text": "Avatar.2009.PROPER.REPACK.BDrip.XviD-ORC"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "4 years",
        "Size": "728.44 MB",
        "Seeders": "2108670",
        "Leechers": "2133497"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3838341/Valentines-Day-DVDrip-XviD-iMBT",
          "text": "Valentines Day DVDrip XviD-iMBT"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "4 years",
        "Size": "726.26 MB",
        "Seeders": "2063084",
        "Leechers": "2119077"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3838279/Avatar-2009-REPACK-BDrip-XviD-ORC",
          "text": "Avatar.2009.REPACK.BDrip.XviD-ORC"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "4 years",
        "Size": "728.44 MB",
        "Seeders": "2658057",
        "Leechers": "2008296"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3838276/Avatar-2009-PROPER-REPACK-BDrip-XviD-ORC",
          "text": "Avatar.[2009].PROPER.REPACK.BDrip.XviD-ORC"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "4 years",
        "Size": "728.44 MB",
        "Seeders": "2000985",
        "Leechers": "1902610"
      },
      {
        "Title": {
          "href": "https://oldpiratebay.org/torrent/3840253/Avatar-2009-PROPER-BDrip-XviD-ORC",
          "text": "Avatar.2009.PROPER.BDrip.XviD-ORC"
        },
        "Type": {
          "href": "https://oldpiratebay.org/search.php?iht=5&age=0",
          "text": "Movies"
        },
        "Age": "4 years",
        "Size": "728.44 MB",
        "Seeders": "2464305",
        "Leechers": "1893080"
      }
    ]
  }
}

Zach Rispoli

27 Jan 2015

I chose to collect alchemical texts from a rather unorganized and outdated-looking website simply titled “The Alchemy Website”. This website hosts several hundred alchemical texts (manuscripts, poetry, etc.) but lacks a proper database/API with which to fetch any of the data… Plus, the links to the texts sprawl across pages, with some links taking you to tables of contents filled with more links, with each page sometimes having a different layout. Some links were even broken.

There’s some very interesting and hard to find stuff in plain text on this website, so it would be a shame if it was lost, so I wrote a very hacked-together text scraper that’s supposed to retrieve raw alchemy texts from the site. It works fairly well (there could be a lot of improvements, though) and I was able to gather a good 2MB of decent text.

I’m not sure what I want to do with this data (it’s a very strange set of data) but as a small proof of concept to show that this data isn’t completely useless, I threw together a Twitter bot that periodically tweets random words of wisdom from the data:

@alchemistadvice
aa

(ok so this data is like pretty useless…not sure if i should continue with it or find something else to use)

Here’s the very hacky could-be-better scraper:

from bs4 import BeautifulSoup
import re
import requests

TAG_RE = re.compile(r'< [^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

def parsePage(pageName):
	print "downloading text from " + pageName

	currentPage = 'http://www.alchemywebsite.com/' + pageName
	pageRequest  = requests.get(currentPage)
	pageData = pageRequest.text

	pageData = pageData.split('
')[1] pageData = pageData.split('
')[0] pageDataSoup = BeautifulSoup(pageData) pageLinks = soup.find_all('a') if(len(pageLinks) > 0): for p in pageLinks: #parsePage(pageName) print p else: pageData = remove_tags(pageData) pageData = pageData.replace(" ","") pageData = pageData.encode('ascii',errors='ignore') pageData = pageData.replace('\n',' ').replace('\r',' ') text_file = open("texts/" + pageName + ".txt", "w") text_file.write(pageData) text_file.close() #assert(False) r = requests.get("http://www.alchemywebsite.com/texts_16th.html") data = r.text soup = BeautifulSoup(data) #links = soup.find_all('a') #for i in range(60,124+1): # print i # print links[i].get('href') #assert(False) links = soup.find_all('a') for i in range(60,124+1): pageName = links[i].get('href') parsePage(pageName)

And here’s the code for the Twitter bot:

import com.temboo.core.*;
import com.temboo.Library.Twitter.Tweets.*;
import java.io.File;
import java.io.FilenameFilter;

TembooSession session = new TembooSession("devbegolag", "myFirstApp", "85399646c0df42d5a5624b9222f86d1a");

void setup() {
  while(true) {
    String result = "";
    
    while(result.length() < 100 || result.length() > 140) {
      File f = new File("/Users/zachrispoli/desktop/alchemistadvice/texts/");
      
      String[] files = f.list(new FilenameFilter() {
      public boolean accept(File dir, String name) {
          return name.toLowerCase().endsWith(".txt");
      }
      });
      int textIndex = int(random(files.length));
      String fileToReadFilename = files[textIndex];
      
      println("Reading from: " + fileToReadFilename);
      
      String lines[] = loadStrings("/Users/zachrispoli/desktop/alchemistadvice/texts/"+fileToReadFilename);
      String line = lines[0];
      
      //println(line);
      
      String sentences[] = line.split("\\.");
      int sentenceIndex = int(random(sentences.length));
      result = sentences[sentenceIndex] + ".";
      println(result);
    }
    runStatusesUpdateChoreo(result);
    
    delay(30000);
  }
}

void runStatusesUpdateChoreo(String tweet) {
  StatusesUpdate statusesUpdateChoreo = new StatusesUpdate(session);

  statusesUpdateChoreo.setAccessToken("xxxxx");
  statusesUpdateChoreo.setAccessTokenSecret("xxxxx");
  statusesUpdateChoreo.setConsumerSecret("xxxxx");
  statusesUpdateChoreo.setStatusUpdate(tweet);
  statusesUpdateChoreo.setConsumerKey("xxxxx");

  StatusesUpdateResultSet statusesUpdateResults = statusesUpdateChoreo.run();
  
  //println(statusesUpdateResults.getResponse());
}