Welcome to MLink Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
312 views
in Technique[技术] by (71.8m points)

python - What are the best approaches to remove similar but not equal lines from a CSV file?

Suppose I have a CSV file like this:

GENE,01,07,Et fecit Deus expansionem: et divisit aquas quae erant sub,,,,
GENE,01,07,"expansione, ab a",,,,
GENE,01,07,"expansione, ab aquis quae erant super expansionem. Et fuit ita.",,,,
GENE,01,08,"Vocavitque Deus expansionem, Caelum. Et fuit vespera, et fuit mane",,,,
GENE,01,08,dies secundus.,,,,

Line 2 is an error where the text was incorrectly truncated. The file has around 53K lines. For lines where the first three columns are equal and the first n characters of column four are equal, I'd like to keep the line where len(column4) is greater.

I've been doing the pre-processing in Python 3, but open to other approaches.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Let us try this using groupby.max() and take the row with maximum length in col4 after grouping by first 3 columns and first 4 characters of col4

df1 = df.assign(length=(df["col4"].str.len()))
        .groupby(['col1','col2','col3',df.col4.str[:4]],as_index=False)
        .max().drop('length',axis=1)

I have used first 4 characters here. Change accordingly in str[:4]


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to MLink Developer Q&A Community for programmer and developer-Open, Learning and Share
...