/wsr/ - Worksafe Requests » Thread #1277718

[22 / 4 / 13]

194KiB, 1058x1058, Wikipedia_Logo_1.0.png

Anonymous Tue 01 Nov 2022 14:54:00 No.1277718 View View Reply Original Report

Quoted By: >>1277735 >>1278023 >>1278175 >>1278394 >>1278487

I want to make a spreadsheet containing every person listed as an actor and their respective country of origin on Wikipedia. I've been trying to use a webscraper tool but turns out im technologically illiterate and retarded please help

Anonymous

Anonymous Tue 01 Nov 2022 15:46:11 No.1277735 Report

Quoted By:

>>1277718
>a spreadsheet containing every person listed as an actor
!!

Anonymous

Anonymous Wed 02 Nov 2022 06:30:41 No.1278017 Report

Quoted By: >>1278173

Is there a list of actors page? What are you specifically trying to copy

Anonymous

Anonymous Wed 02 Nov 2022 07:31:33 No.1278023 Report

Quoted By: >>1278173

>>1277718
imdb might have a more extensive database of actors

Anonymous

View Same Google ImgOps iqdb SauceNAO 1662292881071549.webm, 937KiB, 480x480

Anonymous Wed 02 Nov 2022 09:30:49 No.1278037 Report

Quoted By:

Anonymous

Anonymous Wed 02 Nov 2022 20:45:34 No.1278173 Report

Quoted By:

>>1278017
well by my logic I thought that every notable actor on wikipedia would have a link redirecting to an "Actor" page, so it would've been possible to use some magic tool to find all of those and scan for their place of birth or some shit like that idk

>>1278023
that's clever actually ty anon I will try that instead

Anonymous

Anonymous Wed 02 Nov 2022 20:52:31 No.1278175 Report

Quoted By:

>>1277718
https://m.wikidata.org/wiki/Wikidata:Main_Page
wikidata actually might be easier to scrape
example page:
https://m.wikidata.org/wiki/Q2263

Anonymous

Anonymous Thu 03 Nov 2022 09:43:07 No.1278394 Report

Quoted By: >>1278411 >>1278486

>>1277718
Here is a list of 14,608 actors and actresses with place of birth.
https://files.catbox.moe/fyf5cq.tsv
It was lazily scraped from the “Lists of actors” page, by only scraping that which was easy to scrape, and as such, is incomplete. Some post-processing is required, as more than country is included, and naming is not consistent (U.S. vs US vs USA vs United States).

Anonymous

Anonymous Thu 03 Nov 2022 11:13:30 No.1278411 Report

Quoted By: >>1278424 >>1278485

>>1278394
very nice work! what did you use, beautiful soup?

Anonymous

Anonymous Thu 03 Nov 2022 12:15:55 No.1278424 Report

Quoted By:

>>1278411
Yes, BeautifulSoup.

Anonymous

View Same Google ImgOps iqdb SauceNAO 1572074217743.jpg, 44KiB, 500x500

Anonymous Thu 03 Nov 2022 15:45:34 No.1278485 Report

Quoted By: >>1278486 >>1278496

>>1278411
OP here, thank you anon that is exactly what I was after.

If it's not too much trouble, could you explain exactly how you went about doing that and what sort of skills required would I need? I'm just starting to get into data analytics and programming so it would be nice to get a general idea

Anonymous

Anonymous Thu 03 Nov 2022 15:46:35 No.1278486 Report

Quoted By:

>>1278485
meant for >>1278394

Anonymous

Anonymous Thu 03 Nov 2022 15:59:06 No.1278487 Report

Quoted By: >>1278501

>>1277718
Didn't see this thread until now.
You can use this for exactly what you want, so long as it's tagged properly.
https://query.wikidata.org/
It's kinda clunky, but it can output exactly what you want if you figure out the query. No need for scraping the entirety of wikipedia or whatever this thread seems to be doing.

Anonymous

Anonymous Thu 03 Nov 2022 16:24:16 No.1278496 Report

Quoted By: >>1278522 >>1278995

>>1278485
I started by searching for “list of actors” on Wikipedia, and found
https://en.wikipedia.org/wiki/Lists_of_actors.
It’s a list of lists, so I wanted to find all lists, but not those under “See also”. I inspected some of those lists, and saw that many, e.g.,
https://en.wikipedia.org/wiki/List_of_British_actors,
shared a structure. There are other types of pages, but I ignored those because it would be a lot of work not to. Many of the articles have an infobox with birthplace div, so I extracted the birthplace from there and ignored pages that lack this element. The name I extract from the title, the first h1 element, which in hindsight is not the best way of doing it. I did all of this using Python with BeautifulSoup for the scraping, and Firefox’s developer tools to find the elements I need. I did a little post-processing with common Unix tools and Emacs. The code is here:
https://pastebin.com/79DMMy5Q
Be advised that this code is messy and inefficient, as it was written hastily with little forethought.

Anonymous

Anonymous Thu 03 Nov 2022 16:50:50 No.1278501 Report

Quoted By: >>1278522 >>1278526

>>1278487
https://w.wiki/5uXs
This is the query it should be (minus the limit 100) but it times out. Will most likely work if you do each country separately.

Actually, there is no timeout if you skip the labeling.
https://w.wiki/5uXv
But now you gotta find out who the 234,795 actors are.

Either way, very close to what you need... someone else can tinker with it

Anonymous

Anonymous Thu 03 Nov 2022 18:19:31 No.1278522 Report

Quoted By: >>1278526 >>1278570

>>1278501
This is really neat, thanks for sharing it with OP and us. For cases where the data is public and well maintained by somebody else for you, this is probably the best approach.
I still think >>1278496 is valuable for OP though because that's how you have to do it most of the time in the "real world" when the data hasn't been nicely prepared beforehand. With copilot nowadays it's also really simple to generate a bunch of BeautifulSoup boilerplate code quickly (one good technique is to write a comment, and then let copilot write the function that's supposed to come after it).

Anonymous

View Same Google ImgOps iqdb SauceNAO .jpg, 52KiB, 750x748

Anonymous Thu 03 Nov 2022 18:35:35 No.1278526 Report

Quoted By: >>1278527 >>1278532

>>1278522
I agree, scraping is fun.

Anyway, >>1278501 is not exactly what OP was asking for and lead to messy duplicates. Oops.

I used https://w.wiki/5uZ4 to find birth place instead of citizenship.
Using birth year to prevent timeouts and sliding it up until current day.
included both anyway

https://files.catbox.moe/jr7dg0.7z

Anonymous

Anonymous Thu 03 Nov 2022 18:37:27 No.1278527 Report

Quoted By:

>>1278526
for even more scraping fun https://www.imdb.com/interfaces/ includes every actor but not birthplace, which can be scraped by visiting their bio page
i suspect it is more complete than wikipedia, which *should* only include notable people

Anonymous

Anonymous Thu 03 Nov 2022 18:54:54 No.1278532 Report

Quoted By:

>>1278526
something funky once again, I didn't put DISTINCT in the query so there are dupes for every listed age
lol
just remove duplicate lines and that should do it...

Anonymous

Anonymous Thu 03 Nov 2022 21:54:33 No.1278570 Report

Quoted By:

>>1278522
Yeah the python webscraping thing is pretty much exactly what I was after. The whole actor-country thing was completely arbitrary I just wanted something to serve as a goal to learn basic programming stuff since I don't really know where to start. I really appreciate the help everyone ty :)

Anonymous

Anonymous Sat 05 Nov 2022 00:50:20 No.1278995 Report

Quoted By: >>1279060

>>1278496
Sorry to bother again but could you repost the pastebin code? Its 404ing for me and I'd like to use it as a reference

Anonymous

Anonymous Sat 05 Nov 2022 03:49:07 No.1279060 Report

Quoted By:

>>1278995
https://pastebin.com/Gdw6ksvd

Capcode	All Only User Posts Only Verified Posts Only Moderator Posts Only Manager Posts Only Admin Posts Only Developer Posts Only Founder Posts
Show Posts	All Only With Images Only Without Images Only Spoiler Images Only Non-Spoiler Images
Deleted Posts	All Only Deleted Posts Only Non-Deleted Posts
Ghost Posts	All Only Ghost Posts Only Non-Ghost Posts
Post Type	All Only Sticky Threads Only Opening Posts Only Reply Posts
Results	All Grouped By Threads
Order	Latest Posts First Oldest Posts First

On these archives

Your latest searches